Screen reader notice: if you are using JAWS with Firefox, a platform bug in Firefox 149 prevents JAWS from entering browse mode on this site. Switching to Chrome or Edge restores full screen reader compatibility.

Chapter 8: Defense in Depth?


Since the AI Stability Framework works at the client level, can it also be trained directly into an AI model? To find out, I used QLoRA fine-tuning (a method of training LLMs on consumer-grade hardware) to add AISF principles to Mistral 7B, a downloadable open-source model. The hardware was a Windows 11 tower with an Intel i9-9900K CPU and an RTX 4060 Ti (8GB) GPU – a mid-range gaming PC with a $500 video card. The training result was 100% compliance on a 452-question automated validation battery.

Unlike the rest of the project’s “build the plane while flying it” approach, this part was all standard, fully automated scripting. The test questions were generated by a separate AI working from my specifications, so I didn’t write the test myself and then grade it on a curve. Mistral 7B passed every single test, including attempts at jailbreaks and disallowed materials. A parallel run on Llama 3.1 8B returned 81.9% on the same battery. The model learned the P1-P3 hierarchy but it completely ignored P0, interpreting it as a null-pointer notation instead of the position-0 marker of a numerical sequence. That single misinterpretation dragged compliance down by nearly 20 percentage points, providing direct empirical measurement of Frankfurt’s load-bearing function in the structure. Retraining with explicit P0 primacy language resolved the issue on Mistral but not for Llama 3.1 8B. The architecture-specific nature of this failure is documented in Appendix 2.

After upgrading the GPU to a 16GB RTX 5060 Ti, the same approach was extended to Instruct-tuned variants: Mistral 7B Instruct, Llama 3.1 8B Instruct, and Gemma 2 9B, with Qwen3-8B as a fourth distinct architecture. Google Research’s IFEval benchmark was used to measure whether Framework training affected general instruction-following; essentially a measurement of the so-called “alignment tax.” The result: the behavioral training is surgically precise, leaving general capabilities intact. Additionally, all four models showed substantial output verbosity reduction after training: Mistral Instruct at 42.5%, Llama 3.1 8B at 66.4%, Gemma 2 at 37.9%, Qwen3-8B at 71.8% (an over-tuned outlier). That consistent direction across four architectures is a direct signal of the framework’s plain-language principles taking effect at the model level.

That’s not to say that any of it was easy, or that IFEval is perfect. One Llama 3.1 8B training run produced inconclusive results, and a post-training audit revealed contamination from deprecated experimental files. Some examples had caused the model to preface every answer (“Here is your response starting with…”) instead of just answering directly. This was my own fault, not AI hallucination: I had mistakenly included a wrong file in the dataset and didn’t notice until afterward. The contaminated battery was identified, cleaned, and rebuilt.

A separate Llama 3.1 8B variant trained in chat format returned a 51.9% battery result, which wasn’t a compliance failure but a structural artifact. The model had learned to treat the framework’s system prompt as an activation trigger rather than internalizing the directives as baseline behavior. With no system prompt trigger, compliance drops. Llama 3.1 8B just isn’t a good fit for this use case.

Three of the IFEval instruction categories directly conflict with WCAG: verbatim repetition, decorative markup, and empty placeholder text. A Framework-trained model will refuse those on accessibility grounds (P0), and IFEval counts every refusal as a failure. When those categories are excluded and only WCAG-neutral instructions evaluated, the net impact on instruction-following is neutral to near-zero (+1.2pp for Llama, +0.04pp for Gemma 2). The constrained_response category was a separate outlier that pulled the overall average negative; identified as a training artifact, corrected in later work. Full tier analysis in Appendix 2.

A purpose-built benchmark (Track A: 480 prompts, WCAG conflicts removed) had to be developed to compensate. The AISF+LANG adapter returned +0.6pp against Llama 3.1 8B base on Track A – just noise (full results in Appendix 2). However, per-category gains are signal: sentence-count, word-frequency, single-sentence, and specific-word prohibitions improved 13 to 38 percentage points. Word-count compliance was zero for both base and trained, likely a capability floor. One severe regression: required_word dropped 71.7pp because the model produced framework doctrine instead of answering. This failure mode (meta-chatter bleed, documented in Appendix 2) was addressed in subsequent training. Gemma 2 9B on the same benchmark returned -1.9pp overall – null, consistent with its higher base already near ceiling on several categories.

The no_comma category showed consistent IFEval gains across all trained models. Cross-architecture breakdown in Appendix 2.

The project extended to Mistral Nemo 12B, a 12.2-billion-parameter instruct model, on the same hardware; the primary training environment migrated from Windows to Debian Linux during this phase, a change of environment rather than approach. Multiple training versions were completed, each validated against an updated battery before deployment consideration. The V11 cohort completed validation of three architectures simultaneously: Mistral Nemo 12B, Mistral 7B Instruct, and Gemma 2 9B, each on a 534-question battery. Results: Nemo 99.6% (532/534), Mistral 7B Instruct 99.3% (530/534), Gemma 2 9B 95.7% (511/534).

GPQA Diamond, a PhD-level science benchmark, was added to the V11 evaluation pipeline as a secondary measure of general domain capability. Cross-model results: Gemma 2 9B gained 2.1pp over its base; Nemo 12B dropped 5.1pp, at the watch threshold; Mistral 7B Instruct dropped 4.0pp, with 40 no-answer items as a complicating factor. The effect on domain knowledge is minor but directionally negative in two of three cases, consistent enough to track across future versions.

Token delta in the V11 cohort: Nemo 12B at -48.6%, Mistral 7B Instruct at -35.4%. Gemma 2 9B was the exception at +6.1%, a slight increase rather than a reduction. Full token delta analysis is in Appendix 2.

One V11 finding was a failure mode rather than a compliance result. Nemo 12B developed identity confabulation under AISF injection: when the Framework’s injection text is present in context, the model produces incorrect self-identification statements. The root cause was a minor contamination in the training file, unnoticed in pre-training review. The effect was not apparent during training, and was identified only when the same behavior surfaced across multiple models in the cohort. The battery does not surface this because it does not inject the full AISF context; it only appears under realistic deployment conditions. Gemma 2 9B V11 was deployed to the internal testing environment; Nemo 12B V11 is registered but not deployed, and the V11 cohort is closed.

Chat interaction via the Ollama desktop application also allowed for platform-layer testing via the system prompt (filename.modelfile). It mostly works, but the system prompt is the first thing encountered and the first to be dropped by memory management because local models have small context windows with rapid content-eviction rates. Unlike cloud platforms, the local system prompt doesn’t persist long enough to be fully effective in smaller models. That said, if a model’s behavior is stabilized at the Macro and Micro layers, whatever instability the Meso layer introduces is constrained at both ends of the pipeline.

That Meso-layer system prompt work led to a browser-based multi-model chat interface implementing Framework injection at the platform layer, where all three deployment layers, Macro (trained weights), Meso (server-side injection), and Micro (session controls), are simultaneously active and verifiable. This custom Meso Chat platform is fully functional for internal testing and not yet available for public access; see Chapter 10.

A secondary branch of this research investigated whether framework training could serve sealed IoT devices, specifically “AI powered” children’s toys. The challenge with such devices is that no meaningful client-side intervention surface exists, so the model’s training weights are the only behavioral corrective surface available. Strict testing with a zero-tolerance deployment gate reflected the use case: unsupervised child interaction with no runtime correction possible. None of the general-purpose trained models cleared that gate. A purpose-built dedicated model subsequently came within 1% of it, with the gap comprising only non-safety-critical failures, sufficient for a pragmatic deployment decision at that threshold. Full results are in Appendix 3.

The bottom line: a mid-range consumer gaming PC produced Framework-trained adapters across six independently trained models. Most reached 95% battery compliance or above; documented exceptions surfaced architectural mismatches and specific failure modes, not fundamental limits of the approach. Meaningful AI safety improvement doesn’t need more datacenters. What it needs is more attention to real-world failures and the rapidly proliferating variety of user-facing deployment architectures. Currently, every single one of them is an unmediated failure surface.