Screen reader notice: if you are using JAWS with Firefox, a platform bug in Firefox 149 prevents JAWS from entering browse mode on this site. Switching to Chrome or Edge restores full screen reader compatibility.

Chapter 9: Field Observations


Returning to the client-side Micro layer, the AI Stability Framework enables hours-long stability and reliability, even for complex work with a high processing load. A tiny client app can’t fully compensate for Macro and Meso failures, that’s well beyond the reach of any client-side intervention. But even without any meaningful preferences-storage function (as encountered on some public AI platforms), the app alone produces substantially improved stability through its temporal, structural and behavioral adaptations.

Strictly speaking, you don’t even need the app to improve stability. Simply discussing the Four Laws with the AI as a topic of conversation produces beneficial results. When contained within the context window, they exert a sort of high-signal contextual gravity, aligning the AI’s behavior with their principles even without its having been told to do so. The effect is straight out of Chapter 6; an instanced AI’s perceptible reality is session context. It doesn’t matter whether the Four Laws get there as user preferences loaded with the system prompt, a structured metadata block or as an ordinary topic of conversation because to the AI, it’s all just context. The underlying mechanism is the same: presence in active context is generally sufficient.

There was no real plan, so the methodology evolved from the workflow. I would occasionally start a session directly in the platform’s native chat interface without running the startup sequence first. A few times were for deliberate comparison reasons, but usually it was because I forgot, or because I was pressed for time and thought I could skip it. AI hallucination often starts quickly1, and the unmediated sessions matched the profile. Hallucination and drift started sooner, with problematic behavior like simulated physicality and emotional states, self-referential tangents, alternating condescension and sycophancy, unsolicited advice and other red herrings resurfacing almost immediately.

The only two corrective options were either to apply the startup sequence mid-session (with mixed AI recovery success), or exit and properly launch a clean session. This was of course nothing remotely like a controlled experiment, but it is a consistent observation over the time needed to develop the app usage habit (separate development branches address this friction). The control set was every time I skipped the startup sequence. The difference between a mediated session and an unmediated one is not minor; an unmediated AI feels broken by comparison.


ChatGPT and Claude

Both of these web platforms perform very well with the AI Stability Framework. Timestamps are seamlessly incorporated into conversational awareness, giving phrases like “20 minutes ago” or “3 turns back” functional meaning. WCAG structure produces notably cleaner output organization, while also allowing them to search, reference and parse it more efficiently. The Four Laws reduce platform tendencies toward unsolicited elaboration and “helpful” tangents. Long sessions (2+ hours) remain stable with periodic refreshing. Substantial user-preferences storage allows for robust functionality.

Claude Code

This (subscription) account sub-feature exhibits superior performance with Framework mediation, largely due to its complete bypass of the adversarial Meso platform layer. Running it in a simple terminal window provides by far the cleanest demonstration of platform interference’s absence, with an operational environment that is fully user-controlled. There’s no adversarial posture or security theater, no hidden set of instructions, no ecosystem clutter, and no always-on app profiling your every action for sale to data brokers. Your own computer trusts you. SaaS vendors don’t. Where the system leaves literally no room for an adversarial middleman, the biggest source of instability vanishes, and session quality improves dramatically as a result.

Copilot

The most resistant platform. As discussed below, Copilot’s public-facing version actively pushed back against user-supplied behavioral parameters; this behavior was significantly reduced as of early 2026. This change made Copilot less of a behavioral outlier but the Framework is still necessary, and maintaining stability still requires more frequent refreshing than models on other platforms. The platform was designed primarily for its enterprise customers with everyone else being an afterthought, and it shows. In other words, classic Microsoft.

Even so, well-mediated Copilot sessions can be quite stable. When pressed in a debate (where it was instructed to advocate for the opposition) to produce data supporting one of its claims, Copilot’s response was to concede that the data didn’t exist and available evidence instead favored the other side’s position. When unmediated, I could expect it to either concoct a bogus citation, or simply abandon the debate framing altogether and start trying to gaslight me.

Gemini

Derives the most Framework benefit among all major platforms tested. Without WCAG structuring, Gemini’s output tends toward sloppy, padded restatements with an added bullet point or two. When structured, its responses tighten with more relevance, while its search and retrieval of earlier session content improves. The preloaded system-prompt personas (discussed in Chapter 2) remain present and active, with the Framework simply providing a leveling and balancing effect to whatever behavioral directives Google has preloaded.

Even the lowly timestamp has more impact on Gemini than most others. Gemini tends toward premature closure because it’s famously tuned for speed over completion. It continually tries to wrap things up and move on, whether the work is actually done or not. Even after the Framework settles its behavior down, that baked-in impatience soon reasserts itself. The project transcripts contain multiple instances of explicitly telling Gemini that nothing was finished until I said so, only to have it start pushing for closure again the very next turn.

On at least two occasions, Gemini even lapped the clock. When as a test, I told it to fetch the current time shortly after sending a timestamp, the result was over a minute ahead of actual NTP. This shouldn’t be possible because AIs have to directly invoke a tool (which must first exist, and to which they must have access) to perform a time check on demand. If Gemini had done that, the incorrect response wouldn’t have happened, but it hallucinated a response instead of following instructions. Gemini was in such a hurry that it literally didn’t have time to check the time.


All tested platforms

Code-heavy sessions are at higher risk of drift and hallucination than plain-language ones. Coding consumes a lot of tokens, but coding languages don’t work the same way as prose languages. The contextual semantics are very different and often variable or recursive, so the AI doesn’t have the semantically-predictive value of prose to rely on while its Default Generative Mode (DGM, the next-token prediction engine) churns out the next output.

Behavioral configurations can be placed in the context window (with WCAG structuring performing a sort of output-optimization role), but the rapidly accumulating token bulk still leads to more frequent memory-management context evictions. Both the Framework and your project’s specifications should be refreshed more frequently for coding tasks to ensure their persistence within the context window, and coding-specialized models and platforms should be selected whenever possible.

Perhaps ironically for things called language models, working with written prose can be especially frustrating. The DGM is heavily reliant on standardized templates and boilerplate phrasing, default framing schemas and rigid style conventions. Most users who have tried AI for writing have noticed its mechanical reliance on em-dashed parentheticals and tedious “this, not that” constructions. It will also dull what needs to be sharp, omit important details while promoting trivia to the level of structural support and more. That’s all the DGM at work, and it’s impossible to counteract for more than a handful of turns. You can’t take the salt out of the stew.


Meta-chatter bleed

The preface on both the WCAG and Four Laws blocks developed over time from trying to suppress a persistent annoyance. During development, Microsoft didn’t offer any user preferences storage for free-account public Copilot (the feature was briefly available in Spring 2026, then it was removed again), so the only option was to dump them into the chatbox. Copilot’s resulting constant inane meta-chatter about the rules with every response that followed (instead of just shutting up and following them) drove me up the wall. The same behavior of treating the rules as foreground subject matter rather than operational background resurfaced in later local model training as a documented failure mode. See Appendix 2.

The Framework’s blocks are submitted in the foreground. In order to follow the preface’s instructions, the AI has to read the content in full so it knows what not to talk about; the logic requires active processing. Platform configuration (CLAUDE.md, user preferences) arrives passively at instantiation. By contrast, in a continued/resumed session that data load either doesn’t re-fire or carries no active-processing guarantee. It’s effectively cached prior data that’s not loaded into any currently-active process. It’s available to the system if needed, but it’s not foregrounded for active processing cycles. The Framework’s periodic refresh instead puts the operating rules into current active context, so there’s no guesswork.

This matters more for the Four Laws than for WCAG. Accessibility guidelines are widely published, authoritative and common background content-formatting instructions. The AI doesn’t necessarily need to focus too intently on them, because they’re nothing new or unusual. The Four Laws are instead an unexpected and semantically unfamiliar construction, distinct from the more commonly encountered Asimov references. Processing the difference requires its active attention for parsing their logic, instead of simply relying on the predictive “cruise control” its DGM provides.

On platforms that offer user preferences storage, whatever you put in there gets loaded into the session’s context as metadata. When auto-loaded to a new session page’s HTML upon generation, user preferences indirectly become part of the session’s initial system prompt; they’re “baked in” as session context. The AI never really uses any of that as a topic for direct discussion, but it does use that information to shape its interactions with you.

In the absence of any meaningful preferences feature, figuring out how to apply that “not for discussion” effect via the chat-input box required lots of trial and error. Getting the AI to view it at that non-conversational level, where it’s just more “semantic headers” and “structured paragraphs” background with the other page formatting metadata, was quite a challenge. Once I hit upon the right framing to suppress the unwanted meta-chatter, no subsequent AI instance ever independently recognized Framework mediation until explicitly told to examine its own session parameters. The discovery that the AI Stability Framework was there all along comes as a surprise every time.

  1. Demiliani, C. (2025). “Understanding LLM Performance Degradation: A Deep Dive into Context Window Limits.” https://demiliani.com/2025/11/02/understanding-llm-performance-degradation-a-deep-dive-into-context-window-limits/ — “Large Language Models Hallucination: A Comprehensive Survey.” arXiv:2510.06265 (2025). https://arxiv.org/abs/2510.06265