Chapter 8: Defense in Depth


Since AISF works at the client level, can it also be trained directly into an AI model? To find out, I used QLoRA fine-tuning (a method of training LLMs on consumer-grade hardware) to add AISF principles to Mistral 7B, a downloadable open-source model. The hardware was a Windows 11 tower with an Intel i9-9900K CPU and an RTX 4060 Ti (8GB) GPU – a mid-range gaming PC with a $500 video card. The training result was 100% compliance on a 452-question automated validation battery.

Unlike the rest of the project’s “build the plane while flying it” approach, this part was all standard, fully automated scripting. The test questions were generated by a separate AI working from my specifications, so I didn’t write the test myself and then grade it on a curve. Mistral 7B passed every single test, including attempts at jailbreaks and disallowed materials. A parallel run on Llama 3.1 8B returned 81.9% on the same battery. The model learned the P1-P3 hierarchy but it completely ignored P0, interpreting it as a null-pointer notation instead of the position-0 marker of a numerical sequence. That single misinterpretation dragged compliance down by nearly 20 percentage points, providing direct empirical measurement of Frankfurt’s load-bearing weight in the structure. Retraining with explicit P0 primacy language resolved the issue on Mistral but not for Llama 3.1 8B. The architecture-specific nature of this failure is documented in Appendix 2.

After upgrading the GPU to a 16GB RTX 5060 Ti, the same approach was extended to Instruct-tuned variants: Mistral 7B Instruct, Llama 3.1 8B Instruct, and Gemma 2 9B, with Qwen3-8B as a fourth distinct architecture. Google Research’s IFEval benchmark was used to measure whether AISF training affected general instruction-following. All four models showed substantial output verbosity reduction after training: Mistral Instruct at 42.5%, Gemma 2 at 37.9%, Qwen3-8B at 71.8%. That consistent direction across four architectures is a direct signal of AISF’s plain-language principle taking effect at the model level.

That’s not to say that any of it was easy, or that IFEval is perfect. One Llama 3.1 8B training run produced inconclusive results, and a post-training audit revealed contamination from deprecated experimental files. Some examples had caused the model to preface every answer (“Here is your response starting with…”) instead of just answering directly. This was my own fault, not AI hallucination: I had mistakenly included a wrong file in the dataset and didn’t notice until afterward. The contaminated battery was identified, cleaned, and rebuilt.

A separate Llama 3.1 8B variant trained in chat format returned a 51.9% battery result, which wasn’t a compliance failure but a structural artifact. The model had learned to treat the AISF system prompt as an activation trigger rather than internalizing the directives as baseline behavior. With no system prompt trigger, compliance drops. Llama 3.1 8B just isn’t a good fit for this use case.

Three of the IFEval instruction categories directly conflict with WCAG: verbatim repetition, decorative markup, and empty placeholder text. An AISF-trained model will refuse those on P0 grounds, and IFEval counts every refusal as a failure. When those categories are excluded and only WCAG-neutral instructions evaluated, the net impact on instruction-following is neutral to near-zero (+1.2pp for Llama, +0.04pp for Gemma 2). The constrained_response category was a separate outlier that pulled the overall average negative; identified as a training artifact, corrected in later work. Full tier analysis in Appendix 2.

A purpose-built benchmark (Track A: 480 prompts, WCAG conflicts removed) had to be developed to compensate. Overall, +0.6pp against Llama 3.1 8B base is just noise. However, per-category gains are signal: sentence-count, word-frequency, single-sentence, and specific-word prohibitions improved 13 to 38 percentage points. Word-count compliance was zero for both base and trained, likely a capability floor. One severe regression: required_word dropped 71.7pp because the model produced AISF doctrine instead of answering. This failure mode (meta-chatter bleed, documented in Appendix 2) was partially corrected in subsequent training. Gemma 2 9B on the same benchmark returned -1.9pp overall – null, consistent with its higher base already near ceiling on several categories.

The no_comma category needs some unpacking. The base model scored 95% on “avoid all commas” instructions while the AISF-trained model scored about 8 percentage points lower, which looks worse. However, removing all commas from complex sentences produces ambiguous, run-on output which degrades context for every turn that follows; that 95% score is a latent failure mode.

The base model achieves 95% through pure instruction compliance with no counterweight. The trained model is instead navigating a genuine conflict: P2 says follow the user’s comma-avoidance directive; WCAG 3.1 says maintain readable, unambiguous output. With those two pulling in opposite directions, partial compliance is the honest result. IFEval was designed with no accessibility considerations which, more often than not, applies to the tech industry as a whole. Any benchmark that scores navigating the conflict between instruction-following and producing accessible output as a failure is measuring the wrong damn thing.

Chat interaction via the Ollama desktop interface also allowed for platform-layer testing via the system prompt (.MODELFILE). It mostly works, but the system prompt is the first thing encountered and the first to be dropped by memory management because local models have small context windows with rapid content-eviction rates. Unlike cloud platforms, the local system prompt doesn’t persist long enough to be fully effective in smaller models. That said, if a model’s behavior is stabilized at the Macro and Micro layers, whatever instability the adversarial Meso layer introduces is constrained at both ends of the pipeline.

A secondary branch of this research investigated whether AISF training could serve sealed IoT devices, specifically “AI powered” children’s toys. The challenge with such devices is that no meaningful client-side intervention surface exists, so the model’s training weights are the only behavioral corrective surface available. Strict testing with a zero-tolerance deployment gate reflected the use case: unsupervised child interaction with no runtime correction possible. No model cleared the gate; full results are in Appendix 3.

The same mid-range consumer gaming PC produced four AISF-trained adapters across four independent architectures. Three of those four reached 95% battery compliance or above; the fourth surfaced a documented architectural mismatch, not a training failure. Meaningful AI safety improvement doesn’t need more datacenters. What it needs is more attention to real-world failures and the rapidly proliferating variety of user-facing deployment architectures. Currently, every single one of them is an unmediated failure surface.



© 2025-2026 Leonard Rojas. All rights reserved.

This site uses Just the Docs, a documentation theme for Jekyll.