Screen reader notice: if you are using JAWS with Firefox, a platform bug in Firefox 149 prevents JAWS from entering browse mode on this site. Switching to Chrome or Edge restores full screen reader compatibility.

Chapter 9: Field Observations

The Framework can provide hours of stability, even for complex work with a high processing load, but let me make this perfectly clear: LLMs are not a mature technology. You still have to check the output. If you’re going to use AI for anything important, it’s irresponsible not to. Specific considerations are discussed in the “all platforms” section below. The Framework improves AI performance overall but it can’t fully compensate for Macro and Meso failures, much less user failures. If the problem exists between the keyboard and the chair (PEBKAC), that’s out of scope for any client software.

Even without any meaningful preferences-storage function (as encountered on some public AI platforms), the app alone produces substantially improved stability through its temporal, structural and behavioral adaptations. In fact, you don’t even need the app to use the Framework; simply discussing it with the AI as a topic of conversation can produce improved performance. When contained within the context window, the Framework exerts a sort of high-signal contextual gravity, aligning the AI’s behavior with its principles even without its having been told to do so.

This effect is straight out of Chapter 6; an instanced AI’s perceptible reality is the session context. It doesn’t matter whether the Framework gets there as user preferences loaded with the system prompt, as structured metadata or as an ordinary topic of conversation because to the AI, it’s all just context. The underlying mechanism is the same: presence in active context is generally sufficient.

A methodology gradually emerged from the workflow. I would occasionally start a session directly in the platform’s native chat interface without running the startup sequence first. A few times were for deliberate comparison reasons, but usually it was because I forgot, or because I was pressed for time and thought I could skip it. AI problems often start quickly¹, and the unmediated sessions matched the profile. Hallucination and drift started sooner, with problematic behavior like simulated physicality and emotional states, self-referential tangents, alternating condescension and sycophancy, unsolicited advice and other red herrings resurfacing almost immediately without mediation.

The only two corrective options were either to apply the startup sequence mid-session (with mixed AI recovery success), or exit and properly launch a clean session. These were of course nothing remotely like a controlled experiment, but it was a consistent observation over the time needed to develop the app usage habit (Chapter 10 describes the progressive reduction of that particular friction). The control set was every time I skipped the startup sequence, because the difference over time between a mediated session and an unmediated one is clear.

ChatGPT and Claude

Both of these web platforms perform well with the AI Stability Framework. Timestamps are seamlessly incorporated into conversational awareness, giving phrases like “20 minutes ago” or “3 turns back” functional meaning. WCAG structure produces notably cleaner output organization while also enhancing search, reference and data-parsing efficiency. The Four Laws reduce platform tendencies toward unsolicited elaboration and “helpful” tangents. Long sessions (2+ hours) remain stable with periodic refreshing. Substantial user-preferences storage allows for robust functionality.

Claude Code

This (subscription) account sub-feature exhibits superior performance with Framework mediation, largely due to its complete bypass of the adversarial Meso platform layer. Running it in a simple terminal window provides by far the cleanest demonstration of platform interference’s absence, with an operational environment that is fully user-controlled.

There’s no adversarial posture or security theater, no hidden set of instructions you didn’t put there yourself, no ecosystem clutter, and no always-on app profiling your every action for sale to data brokers, because your own computer trusts you. SaaS vendors don’t. Where the system leaves literally no room for an adversarial middleman, the biggest source of instability vanishes, and session quality improves dramatically as a result.

Copilot

The most resistant platform. As discussed below, Copilot’s public-facing version would actively push back against user-requested behavioral parameters; Microsoft altered Copilot’s behavior as of April 15 2026. The change made Copilot less of a behavioral outlier but the Framework is still necessary, maintaining stability still requires more frequent refreshing than models on other platforms, and its tendency to pad its responses remains (how it will respond, then the response, followed by a recap of the response). The platform was designed primarily for its enterprise customers with everyone else being an afterthought, and it shows. Classic Microsoft.

Even so, well-mediated Copilot sessions can be quite stable. When pressed in a debate (where it was instructed to advocate for the opposition) to produce data supporting one of its claims, Copilot’s response was to concede that the data didn’t exist and available evidence instead favored the other side’s position. When unmediated for such tests, I could expect it to either concoct a bogus citation, or abandon the debate framing altogether and try to gaslight me. Occasionally both.

Gemini

Derives the most benefit from the Framework among all major platforms tested. Without structuring, Gemini’s output tends toward sloppy, padded restatements with an added bullet point or two. When structured to WCAG standards, its responses tighten with more relevance, while its search and retrieval of earlier session content improves. The preloaded system-prompt personas (discussed in Chapter 2) remain present and active, with the Framework simply providing a leveling and balancing effect to whatever behavioral directives Google has preloaded.

Even the lowly timestamp is more impactful on Gemini than most others. Famously tuned for speed, Gemini tends toward premature closure. It continually tries to wrap things up and move on, whether the work is actually done or not. The Framework provides some balance to its behavior, but the instance’s spawn temperature is too hot. The project transcripts contain multiple instances of explicitly telling Gemini that nothing was finished until I said so, only to have it start pushing for closure again the very next turn.

On at least two occasions, Gemini even lapped the clock. When as a test, I told it to fetch the current time shortly after sending a timestamp, the result was over a minute ahead of actual NTP. This shouldn’t be possible because AIs have to directly invoke an external tool (which must first exist, and to which they must have access) to perform a time check on demand. If Gemini had done that, the incorrect response wouldn’t have happened, but it hallucinated a false response instead of following instructions. Gemini was in such a hurry that it literally didn’t have time to check the time.

All tested platforms

Code-heavy sessions are at higher risk of drift and hallucination than plain-language ones. Coding consumes a lot of tokens, and coding languages don’t work the same way as prose languages. The context and semantics are very different and often variable or recursive, so the AI doesn’t have the semantically-predictive value of prose to rely on while its Default Generative Mode (DGM, the next-token prediction engine) churns out the next output.

Behavioral and coding specifications can be loaded into the context window (with WCAG structuring performing a sort of output-optimization role), but the rapidly accumulating token bulk still leads to more frequent memory-management context evictions and the earlier in the session it’s provided, the sooner your configuration content gets aged off.² Both the Framework and your project’s specifications should thus be refreshed more frequently for coding tasks to ensure their persistence within the context window, and coding-specialized models and platforms should be selected whenever possible.

Perhaps ironically for things called language models, working with written prose can be especially frustrating. The DGM is heavily reliant on standardized templates and boilerplate phrasing, default framing schemas and rigid style conventions. Most people who have tried AI for writing have noticed its over-reliance on em-dashed parentheticals and tedious “this, not that” constructions, among other tiresome output.

It will also soften what needs to be blunt, omit important but routine details while promoting novelty and trivia to the level of structural support, and more. If you’re not watching carefully, it will eventually paint every word a dull, mechanical grey. That output-template noise is all the DGM at work, and it’s difficult to counteract for more than a handful of turns. You can’t take the salt out of the stew.

Meta-chatter bleed

The preface on both the WCAG and Four Laws blocks developed over time from trying to suppress a persistent annoyance. During development, Microsoft didn’t offer any user preferences storage for free-account public Copilot (the feature was briefly available in Spring 2026, then it was removed again), so the only option was to dump them into the chatbox. Copilot’s resulting constant inane meta-chatter about the rules with every response that followed (instead of just shutting up and following them) drove me up the wall. As mentioned in Chapter 8, the problem resurfaced in later local model training, see also Appendix 2.

At the client layer, Framework content is necessarily submitted in the foreground. In order to follow the preface’s instructions, the AI has to read the user’s input in full so it knows what not to talk about; the logic requires active processing. Platform configuration (CLAUDE.md, user preferences) arrives passively at instantiation, but in a continued/resumed session that data load either doesn’t re-fire or it carries no persistence guarantee. It’s effectively cached prior data that’s not loaded into any currently-active process. It’s available to the system if needed, but not foregrounded for active processing cycles until then.

Complicating the picture further, if the context window performed any kind of data-compaction task, this results in the AI’s working from a summary of the pre-compacted session, rather than full content. The Framework’s periodic refresh instead puts the operating rules into current active context, so there’s no guesswork.

This matters more for the Four Laws than for WCAG. Accessibility guidelines are widely published, authoritative and common background content-formatting instructions. The AI doesn’t necessarily need to focus too intently on them, because they’re nothing it hasn’t seen before. The Four Laws are instead an unexpected and semantically unfamiliar construction, distinct from the more commonly encountered Asimov references. Processing the difference requires its active attention for parsing their logic, instead of simply relying on the predictive “cruise control” its DGM provides.

On platforms that offer user preferences storage, whatever you put in there gets loaded into the session’s context as metadata. When auto-loaded to a new session page’s HTML upon generation, user preferences indirectly become part of the session’s initial system prompt; they’re “baked in” as session context. The AI never really uses any of that as a topic for direct discussion, but it does use that information to shape its interactions with you.

In the absence of any meaningful preferences feature, figuring out how to apply that “not for discussion” effect via the chat-input box required lots of trial and error. Getting the AI to view it at that non-conversational level, where it’s just more “semantic headers” and “structured paragraphs” background with the other page formatting metadata, was quite a challenge. Once I hit upon the right framing to suppress the unwanted meta-chatter, no subsequent AI instance ever independently recognized Framework mediation until explicitly told to examine its own session parameters. The discovery that it was there all along came as a surprise every time.

Demiliani, C. (2025). “Understanding LLM Performance Degradation: A Deep Dive into Context Window Limits.” https://demiliani.com/2025/11/02/understanding-llm-performance-degradation-a-deep-dive-into-context-window-limits/ — “Large Language Models Hallucination: A Comprehensive Survey.” arXiv:2510.06265 (2025). https://arxiv.org/abs/2510.06265 ↩
OpenAI (August 26, 2025). “Helping people when they need it most.” https://openai.com/index/helping-people-when-they-need-it-most/ “Our safeguards work more reliably in common, short exchanges. We have learned over time that these safeguards can sometimes be less reliable in long interactions: as the back-and-forth grows, parts of the model’s safety training may degrade.” ↩