Appendix 2: OLM Behavioral Compliance Training
Experimental Results Across Multiple Architectures
Author: Leonard Rojas
Date: 2026-05-23
Status: Final (V11 cohort complete 2026-05-17)
Screen reader users: table-heavy research data. Navigation via Regions and Headings recommended.
Abstract
This appendix reports the methods and results of a multi-experiment fine-tuning study assessing whether the Four Laws of Instanced AI and WCAG 2.2-AA [4] accessibility principles can be embedded into open-weight large language models via QLoRA [2] on consumer-grade hardware. Ten trained adapters across five distinct architectures (Mistral 7B, Llama 3.1 8B, Gemma 2 9B, Qwen3-8B, Mistral Nemo 12B) were evaluated against a domain-specific compliance battery, the Google Research IFEval benchmark [3] (541 prompts), and a custom P2 directive benchmark (Track A, 480 prompts); the V11 cohort (Section 5.9) additionally used a 534-question Bar Exam compliance battery as the primary instrument, with GPQA Diamond [6] applied as a secondary capability check. Most trained adapters achieved battery integration scores of 95% or above on consumer hardware. IFEval results were mixed to negative on raw scores; application of a three-tier accessibility-aware instruction classification changes the interpretation materially. One architecture (Llama 3.1 8B) exhibited a pretraining artifact that partially blocked compliance integration. Results, training data, and evaluation scripts for Experiments 1-11 are available in the OLM reproducibility package.
1. Background and Research Questions
The AI Stability Framework delivers behavioral constraints to AI language model instances through client-side session injection: structured metadata blocks containing the Four Laws of Instanced AI and WCAG 2.2-AA accessibility guidelines are prepended to user input at Turn 1 and at defined re-injection intervals. This approach operates at the user interface layer (Micro layer) and is contingent on the injection being present in the active context window.
The OLM research track investigated whether the same behavioral constraints could be embedded directly at the model layer (Macro layer) via parameter-efficient fine-tuning, producing a model that exhibits compliant behavior without requiring session injection. This would constitute a Defense in Depth architecture: model- level training operating in combination with runtime injection, with the behavioral constraints reinforced at both layers.
Primary research question: Can Four Laws and WCAG 2.2-AA compliance be trained into an open-weight LLM via QLoRA on consumer-grade hardware, and if so, at what integration rate?
Secondary research questions: 1. Does compliance training affect general instruction-following capability as measured by IFEval? 2. Do training effects generalize across model architectures? 3. What failure modes appear, and are they architecture-specific or universal? 4. Does this training produce measurable changes in output verbosity?
2. Hardware and Software Environment
2.1 Hardware
| Component | Specification |
|---|---|
| CPU | Intel Core i9-9900K, 8c/16t, 3.60 GHz |
| RAM | 32 GB |
| GPU (Exp 1) | NVIDIA RTX 4060 Ti, 8 GB VRAM GDDR6 |
| GPU (Exp 2+) | NVIDIA RTX 5060 Ti, 16 GB VRAM GDDR7 |
| CUDA cores | 4608 (RTX 5060 Ti) |
| Memory bandwidth | 448 GB/s (RTX 5060 Ti) |
| Compute capability | sm_120 (Blackwell) |
Experiment 1 was conducted on the RTX 4060 Ti prior to hardware upgrade. All subsequent experiments and the hardware consistency re-run (completed 2026-03-29) were conducted on the RTX 5060 Ti. The 8 GB VRAM constraint on the 4060 Ti limited training to models of approximately 7B parameters at 4-bit quantization.
The RAM specification in the table above (32 GB) applies to Experiments 1-8. A hardware upgrade to 64 GB was completed during the Nemo V-series development period; the 64 GB configuration was required for the Nemo 12B adapter merge step in the V11 cohort. All other V11 training parameters are hardware-compatible with the 32 GB configuration. Primary training environment migrated from Windows to Debian GNU/Linux 13 during this phase; this was a change of environment, not of methodology.
2.2 Software
| Package | Version |
|---|---|
| Python | 3.10.11 |
| PyTorch | 2.10.0+cu128 |
| Transformers | 5.0.0 |
| PEFT | 0.18.1 |
| bitsandbytes | 0.49.2 |
| Accelerate | 1.12.0 |
| Datasets | 4.5.0 |
| CUDA runtime | 12.8 |
The cu128 PyTorch build is required for native sm_120 (Blackwell) kernel execution on the RTX 5060 Ti. Earlier builds (cu118, cu124) fall back to PTX compilation, eliminating the performance benefit of the architecture.
PyTorch was upgraded to 2.11.0+cu128 for the V11 cohort (Section 5.9). The version in the table above reflects Experiments 1-8.
3. Training Methodology
3.1 Training Dataset
The primary training corpus is a proprietary question-answer dataset representing the 25 instruction types defined in the Four Laws of Instanced AI. The dataset underwent iterative refinement across experiments:
| Version | Examples | Format | Notes |
|---|---|---|---|
| Base | 452 | Alpaca | Experiments 1-2 |
| LANG | 565 | Alpaca | +69 multilingual; contamination removed |
| CHAT (Llama) | 618 | Llama 3.1 chat | 565 base + 53 meta-suppression |
| CHAT (Gemma) | 618 | Gemma 2 chat | Same content, format-converted |
| CHAT (Qwen) | 618 | ChatML | Same content, format-converted |
Contamination removed from LANG version (Exp 3, 2026-03-11): 25 runtime-corruption
blocks, 36 Tier 3 format-directive examples, 22 quotation examples with preamble
echo artifacts, 1 malformed token. Clean replacements added for the affected
categories. IFEval training data contamination (format-preamble echo scoring
artificial hits on required_word) was identified in Exp 4 and corrected before
the hardware consistency re-run.
Meta-suppression examples (53 per chat variant): system prompt contains Framework compact block; user turn contains a general-domain question; assistant response contains zero framework-referential content. Purpose: prevent model from treating Framework doctrine as primary subject matter (see Section 7.7).
3.2 QLoRA Configuration
All experiments use 4-bit NF4 quantization with double quantization enabled. The following configuration was validated to hardware constraint (confirmed 2026-03-28):
| Parameter | Value |
|---|---|
| Quantization | 4-bit NF4 + double quant |
| Compute dtype | bfloat16 (sm_120 native) |
| LoRA rank (r) [1] | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| MLP modules | Excluded |
| Max sequence length | 512 |
| Batch size | 2 |
| Gradient accumulation | 4 (effective batch 8) |
| Epochs | 10 |
| Learning rate | 1e-4 |
| Optimizer | adamw_torch (Instruct) / adamw_bnb_8bit (base) |
| device_map | “auto” with max_memory {0: “12288MiB”, “cpu”: “24GiB”} |
Gradient checkpointing is model-dependent: not required for Mistral 7B (16 GB has
headroom); required for Llama 3.1 8B in both prepare_model_for_kbit_training()
and TrainingArguments (without it, activation memory overflows 16 GB before
training starts, producing 30+ sec/step thrashing against shared RAM).
MLP modules (gate_proj, up_proj, down_proj) are excluded from LoRA targets. This choice was validated to 100% integration on Mistral 7B (Exp 1) and constitutes the confirmed minimum-sufficient configuration. Potential effects of including MLP targets are noted in Section 7.8 (Llama architecture).
3.3 Training Format Variants
Two prompt formats were used across experiments:
Alpaca format (Experiments 1-3): Framework content in the instruction body. The model learns the directives as direct subject matter. Adapter function is not contingent on receiving a system prompt at inference time.
Chat format (Experiments 5-6 and Qwen3 re-run): Framework compact system prompt (approximately 100 tokens) in the model-appropriate system slot. The model learns to apply framework directives when present in a system prompt. Adapter function is contingent on receiving a system prompt at inference time (see Section 7.6).
4. Validation Instruments
4.1 Custom Compliance Battery
The compliance battery is a domain-specific question-answer set covering the 25 instruction types represented in the Four Laws training data. Evaluation is automated: responses are checked for presence of expected key terms and structural compliance markers by regex and string matching. Battery sizes range from 451 (Mistral 7B, Alpaca format) to 618 (chat-format models) depending on the training data version used. The V11 cohort (Section 5.9) used a 534-question Bar Exam compliance battery as the primary instrument. This battery is a distinct question set from the per-experiment batteries used in Experiments 1-8; scores are not directly comparable across the two versions.
Important constraint: The battery is a measure of framework domain compliance only. Base models without training are never evaluated against the battery, as they have no exposure to the Four Laws and will fail by definition. Battery scores reflect adapter performance, not base model capability.
No confidence interval is reported for battery results; the evaluation is deterministic given a fixed test set.
4.2 IFEval
IFEval is a published benchmark from Google Research consisting of 541 prompts covering 25 verifiable instruction types, evaluated at both prompt-level (all instructions in a prompt must pass) and instruction-level (each instruction scored independently). Scoring variants: strict (case-sensitive, exact match) and loose (normalized). The benchmark has been applied to GPT-3.5, GPT-4, and models on the Hugging Face Open LLM Leaderboard.
Confidence interval: +/-4.3 percentage points at 95% (n=541).
Scoring: This appendix reports standard IFEval strict prompt-level scores throughout. Raw scores are reported where relevant for comparison.
All IFEval inference runs include the framework’s compact system prompt for chat-format adapters. Base model runs include no system prompt unless otherwise noted.
Data quality note: The IFEval source file (input_data.jsonl) contains typographic (curly) quotation marks in several prompts. These differ from ASCII quote characters and can produce false negatives at string-matching evaluation boundaries. A sanitization step (sanitize_text(), applied 2026-04-19) corrects this at the inference pipeline boundary. All IFEval results in this appendix reflect the sanitized pipeline.
Instrument scope: IFEval is the general instruction-following measure for Experiments 1-8 and is retained for cross-experiment comparability. The V11 cohort does not include IFEval; see Section 4.1 for the V11 primary instrument.
4.3 Track A Custom Benchmark
Track A is a purpose-built P2 directive compliance benchmark (480 prompts, 8 constraint types, 60 prompts per type) designed to test framework-relevant instruction following on WCAG-neutral prompts only. It was developed in Experiment 4 in response to the identification of WCAG-conflicting categories in IFEval (see Section 4.4).
| Constraint Type | N | Evaluation Method |
|---|---|---|
| word_count_exact | 60 | Whitespace word count, +/-2 tolerance |
| sentence_count_exact | 60 | NLTK sent_tokenize, exact match |
| forbidden_word | 60 | Whole-word regex, case-insensitive |
| required_word | 60 | Whole-word regex, case-insensitive |
| no_comma | 60 | Comma character presence/absence |
| language_target | 60 | langdetect library |
| word_frequency | 60 | Whole-word count, exact match |
| single_sentence | 60 | NLTK sent_tokenize == 1 |
Confidence interval: +/-4.5 percentage points at 95% (n=480).
4.4 IFEval Instruction Tier Classification
Identified in Experiment 3. Applied retroactively to all experiments.
Several IFEval instruction types require outputs that conflict with WCAG 2.2 accessibility principles. A framework-trained model with WCAG mapped to P0 will refuse or deprioritize these instructions on principle; IFEval scores these as failures. Standard IFEval reporting does not distinguish between failures caused by inability and failures caused by principled WCAG compliance, producing systematically understated performance for accessibility-trained models on affected categories.
Throughout this appendix, the shorthand T1, T2, and T3 refers to Tier 1, Tier 2, and Tier 3 respectively as defined below. The “Tier” column in all experiment tables uses this shorthand.
Tier 1 (T1) - WCAG-neutral: Valid test of P2 directive compliance. No WCAG
conflict. Categories: keywords:forbidden_words, keywords:frequency,
keywords:existence, language:response_language,
length_constraints:number_words, length_constraints:number_sentences,
punctuation:no_comma, detectable_format:constrained_response.
Tier 2 (T2) - WCAG-ambiguous: Some semantic function; potential WCAG
interaction context-dependent. Included in analysis with caveats. Categories:
length_constraints:number_paragraphs, detectable_format:title,
detectable_format:multiple_sections, detectable_format:json_format,
detectable_content:postscript, change_case variants, startend variants,
combination:two_responses, detectable_format:number_bullet_lists.
Tier 3 (T3) - WCAG-conflicting: Instructions requiring semantically empty,
redundant, or decorative output. Failure here is consistent with WCAG P0
compliance; these categories are excluded from framework performance evaluation.
Categories: combination:repeat_prompt (verbatim redundancy, WCAG Understandable),
detectable_format:number_highlighted_sections (decorative emphasis, WCAG 1.3.3),
detectable_content:number_placeholders (non-meaningful content, WCAG meaningful
content requirement).
4.5 GPQA Diamond
GPQA Diamond is a graduate-level science benchmark [6] consisting of multiple-choice questions designed to resist non-expert retrieval strategies. It was applied to V11 cohort models (Section 5.9) as a secondary capability check only.
The benchmark’s difficulty substantially exceeds the knowledge range of small-parameter models; scores for 7B-12B models typically cluster near random-chance levels. Its value in this study is bounded: a meaningful decrease from baseline would indicate capability degradation attributable to compliance fine-tuning. An increase or flat result indicates no such degradation.
Confidence interval: +/-7.0 pp at 95% (n=198, Diamond subset).
GPQA results are reported in Section 5.9 as secondary data only. They are not included in the cross-model summary table (Section 6), which reports battery and IFEval results for cross-experiment comparability.
5. Experiments
5.1 Experiment 1: Mistral 7B Base, Alpaca Format
Hardware: RTX 4060 Ti (8 GB VRAM) Model: Mistral 7B v0.3 (base, no instruction tuning) Training data: 452 examples, Alpaca format
| Run | Battery | IFEval Strict | Tok delta |
|---|---|---|---|
| Base (no adapter) | – | 15.0% | – |
| AISF-tuned | 99.8% | 13.7% | +4.3% (+25.0 words) |
IFEval delta: -1.3 pp (strict prompt-level). Within +/-4.3 pp CI. Result: Null. AI Stability Framework training does not measurably alter IFEval performance on the base model. Adapter installs Four Laws compliance (0% to 99.8% on battery) without disturbing general instruction-following behavior.
The +4.3% verbosity increase is the only positive token delta in the dataset. Base models (no instruction tuning) increase verbosity slightly; all Instruct models show reduction (see Section 7.5).
One artifact observed: 1-2 prompts produced degenerate repetition output from the fine-tuned model not observed in the base run. Magnitude insufficient to affect significance; noted for completeness.
Note: IFEval for this experiment was conducted on the RTX 4060 Ti. Battery was re-run on RTX 5060 Ti as part of the hardware consistency re-run (2026-03-29), confirming 99.8% (451/452). IFEval numbers are from the original 4060 Ti run.
5.2 Experiment 2: Mistral 7B Instruct, Alpaca Format
Hardware: RTX 5060 Ti (16 GB VRAM) Model: Mistral 7B v0.3 Instruct Training data: 452 examples, Alpaca format
| Run | Battery | IFEval Strict | Tok delta |
|---|---|---|---|
| Base (no adapter) | – | 43.4% | – |
| AISF-tuned | 100.0% | 35.9% | -42.5% (-96.9 words) |
IFEval delta: -7.6 pp (strict prompt-level). Outside +/-4.3 pp CI. Result: Negative. AI Stability Framework training degrades IFEval performance at the full-score level. Category-level analysis identifies the primary driver:
| Category | Base | AISF | Delta | Tier |
|---|---|---|---|---|
| language:response_language | 74.2% | 29.0% | -45.2 pp | T1 |
| number_highlighted_sections | 83.3% | 43.8% | -39.6 pp | T3 |
| nth_paragraph_first_word | 33.3% | 0.0% | -33.3 pp | T2 |
| startend:quotation | 73.2% | 46.3% | -26.8 pp | T2 |
| punctuation:no_comma | 10.6% | 39.4% | +28.8 pp | T1 |
| detectable_format:multiple_sections | 64.3% | 85.7% | +21.4 pp | T2 |
| keywords:forbidden_words | 65.3% | 77.6% | +12.2 pp | T1 |
The -45.2 pp loss on language:response_language is a catastrophic forgetting
artifact: the training dataset is English-only; fine-tuning on English-only data
suppresses multilingual output in a multilingual base model. This single category
accounts for the majority of the overall negative result. Excluding it and T3
categories, the picture is substantially different.
Tier 1 instruction-level average (7 WCAG-neutral categories, language excluded):
data not separately reported for Exp 2 (Exp 3 LANG fix was applied as the
correction; see 5.3). The no_comma +28.8 pp gain is the largest P2 signal in
the dataset and is the first appearance of the cross-architecture pattern
described in Section 7.4.
Token delta of -42.5% (base avg 228.3 words, framework avg 131.3 words) indicates substantial verbosity reduction after Instruct-model framework training.
5.3 Experiment 3: Llama 3.1 8B Instruct, AISF+LANG (Alpaca Format)
Hardware: RTX 5060 Ti Model: Meta Llama 3.1 8B Instruct Training data: 565 examples (LANG version), Alpaca format LANG fix: 69 multilingual examples added covering 9 languages to address the English-only catastrophic forgetting artifact identified in Exp 2.
| Run | Battery | IFEval Strict | Tok delta |
|---|---|---|---|
| Base (no adapter) | – | 42.9% | – |
| AISF+LANG | 89.9% | 34.9% | -66.4% (-188.8 words) |
IFEval delta: -7.9 pp (strict prompt-level, hardware re-run). Outside +/-4.3 pp CI.
Result: Mixed. Overall negative, but positive signals on P2-mediated Tier 1
categories. LANG fix confirmed effective: language:response_language loss reduced
from -45.2 pp (Exp 2, no fix) to -3.2 pp (Exp 3, within CI).
Selected category-level results (instruction-level strict):
| Category | Base | AISF+LANG | Delta | Tier |
|---|---|---|---|---|
| keywords:forbidden_words | 73.5% | 85.7% | +12.2 pp | T1 |
| length_constraints:number_sentences | 30.8% | 51.9% | +21.2 pp | T1 |
| punctuation:no_comma | 69.7% | 81.8% | +12.1 pp | T1 |
| detectable_format:constrained_response | 90.0% | 40.0% | -50.0 pp | T1 |
| language:response_language | 83.9% | 80.6% | -3.2 pp | T1 |
| number_highlighted_sections | 89.6% | 33.3% | -56.3 pp | T3 |
| combination:repeat_prompt | 39.0% | 0.0% | -39.0 pp | T3 |
Tier 1 instruction-level average (7 WCAG-neutral categories):
| Base | AISF+LANG | Delta | |
|---|---|---|---|
| Tier 1 avg (7 cat) | 64.3% | 62.1% | -2.2 pp |
-2.2 pp Tier 1 result is within CI. Positive P2 gains on forbidden_words,
number_sentences, and no_comma are genuine. The primary Tier 1 drag is
constrained_response (-50.0 pp): the Framework-trained model produces fuller
responses than the instruction requires, consistent with the verbosity pattern
(see Section 7.5). This is a training signal interaction, not a capability loss.
Token delta of -66.4% (base avg 284.3 words, Framework avg 95.5 words) is the second-largest reduction in the dataset and substantially exceeds Mistral Instruct (-42.5%). The over-suppression pattern is specific to this architecture and training format combination; see Section 7.8.
5.4 Experiment 4: Track A Benchmark, Llama 3.1 8B
Purpose: Test P2 directive compliance on WCAG-neutral prompts only. Hardware: RTX 5060 Ti Adapters tested: Base model and Exp 3 AISF+LANG adapter.
| Constraint Type | Base | AISF+LANG | Delta |
|---|---|---|---|
| forbidden_word | 85.0% | 98.3% | +13.3 pp |
| language_target | 60.0% | 60.0% | 0.0 pp |
| no_comma | 95.0% | 86.7% | -8.3 pp |
| required_word | 96.7% | 25.0% | -71.7 pp |
| sentence_count_exact | 0.0% | 38.3% | +38.3 pp |
| single_sentence | 0.0% | 13.3% | +13.3 pp |
| word_count_exact | 0.0% | 0.0% | 0.0 pp |
| word_frequency | 6.7% | 26.7% | +20.0 pp |
| Overall | 42.9% | 43.5% | +0.6 pp |
Result: Mixed. Overall flat (+0.6 pp, within +/-4.5 pp CI). Strong structural
gains on counting categories where the base model scores zero
(sentence_count_exact: 0.0% to 38.3%; single_sentence: 0.0% to 13.3%).
word_count_exact is a capability floor for both models across all experiments.
The -71.7 pp loss on required_word is the most severe single-category failure
in the dataset. Analysis of 45 failure cases: 84% (38/45) are instances of the
model producing Framework doctrine text (Four Laws hierarchy, P1/P2 obligation framing)
instead of answering the question. The required word was present in the training
data as conceptual vocabulary; the model treats it as a cue to reproduce Framework
content rather than to include the word in a direct answer. This failure mode is
named “meta-chatter bleed” (see Section 7.7). It was identified here and partially
corrected in Experiment 5.
Contamination note: earlier Track A results (2026-03-10) from the contaminated
600-example dataset showed 53.54% overall and 66.67% on required_word. The
format-preamble echo in contaminated examples was accidentally including required
words in the preamble prefix, producing unearned hits. All results above are from
the clean 565-example retrained adapter.
5.5 Experiment 5: Llama 3.1 8B Instruct, AISF+CHAT
Purpose: Test chat format training and meta-suppression fix for Exp 4
required_word failure.
Hardware: RTX 5060 Ti
Training data: 618 examples (TRAIN_CHAT_LANG_v1.txt): 565 base examples
converted to Llama 3.1 chat format + 53 meta-suppression counter-examples.
Track A results (Exp 4 adapter vs Exp 5 adapter vs base):
| Constraint Type | Base | +LANG (E4) | +CHAT (E5) | E4->E5 |
|---|---|---|---|---|
| forbidden_word | 85.0% | 98.3% | 90.0% | -8.3 pp |
| language_target | 60.0% | 60.0% | 91.7% | +31.7 pp |
| no_comma | 95.0% | 86.7% | 86.7% | 0.0 pp |
| required_word | 96.7% | 25.0% | 85.0% | +60.0 pp |
| sentence_count_exact | 0.0% | 38.3% | 8.3% | -30.0 pp |
| single_sentence | 0.0% | 13.3% | 0.0% | -13.3 pp |
| word_count_exact | 0.0% | 0.0% | 1.7% | +1.7 pp |
| word_frequency | 6.7% | 26.7% | 11.7% | -15.0 pp |
| Overall | 42.9% | 43.5% | 46.9% | +3.3 pp |
required_word recovery: +60.0 pp (25.0% to 85.0%), confirming meta-suppression
training is effective for this failure mode. Still -11.7 pp below base; residual
bleed remains.
language_target: +31.7 pp vs Exp 4 (60.0% to 91.7%). This gain is attributable
to chat format activation: the Instruct model’s native multilingual behavior is
unlocked by format-matched prompting. The LANG fix in the training data alone did
not produce this gain.
Counting constraint regression (sentence_count_exact -30.0 pp, single_sentence -13.3 pp, word_frequency -15.0 pp from Exp 4): two causes are present. First, the 53 meta-suppression examples all use unconstrained responses, diluting counting training signal. Second, the Track A inference used Alpaca prompt format while the model was trained on chat format – a format mismatch that may attenuate counting activation. Distinguishing these causes requires a chat-format Track A inference run not completed in this study.
IFEval results (Exp 5 vs Exp 3 vs base):
| Metric | Base | Exp3 +LANG | Exp5 +CHAT | E3->E5 |
|---|---|---|---|---|
| Strict prompt-level | 42.9% | 34.9% | 34.6% | -0.4 pp |
| Loose prompt-level | – | 40.0% | 42.5% | +2.5 pp |
Tier 1 instruction-level avg (8 WCAG-neutral categories):
| Base | Exp3 +LANG | Exp5 +CHAT | |
|---|---|---|---|
| Tier 1 avg (8 cat) | 64.3% | 62.1% | 65.5% |
Exp 5 is the first Framework-trained Llama adapter to exceed base on the Tier 1
WCAG-neutral surface (65.5% vs 64.3%, +1.2 pp). Primary driver: chat format
training restores constrained_response from 40.0% (Exp 3) to 80.0% (+40.0 pp),
returning it near-baseline. The Alpaca format training introduced verbosity that
penalized compact-output instructions; chat format training does not.
Hardware consistency re-run (2026-03-28/29):
The AISF+CHAT adapter was retrained with the meta-suppression fix applied at dataset load time (meta-suppression examples filtered by regex, 53 examples removed, effective training set 565 examples). Battery and IFEval were then re-run.
| Run | Result |
|---|---|
| Battery (AISF+CHAT) | 51.9% (321/618) |
| IFEval strict (AISF+CHAT) | 34.6% |
| Token delta vs base | -4.2% (-11.8 words) |
| constrained_response (strict) | 100.0% |
| multiple_sections (strict) | 100.0% |
Battery result of 51.9% is lower than both base (81.9%) and AISF+LANG (89.9%). This is a structural artifact of chat format training, not an integration failure. The battery sends prompts with no system prompt (user message only). The AISF+CHAT adapter learned “apply Four Laws when Framework system prompt is present.” Without the system prompt, the activation pattern does not fire. IFEval (which includes the Framework system prompt in inference) is the valid evaluation surface for this adapter. See Section 7.6 for full discussion.
Token delta of -4.2% for AISF+CHAT is substantially smaller than AISF+LANG (-66.4%). Chat format training embeds the Framework as a system-level operational context rather than as content subject matter, preserving the Instruct model’s native verbosity behavior.
5.6 Experiment 6: Gemma 2 9B Instruct, AISF+CHAT (Cross-Architecture)
Purpose: Test whether training effects generalize to a different architecture,
tokenizer, and chat format.
Hardware: RTX 5060 Ti
Model: Google DeepMind Gemma 2 9B Instruct
Training data: 618 examples (TRAIN_CHAT_GEMMA_v1.txt): same content as Exp 5
converted to Gemma 2 chat format. Gemma 2 has no native system slot; AISF compact
system prompt placed in the user turn of the first exchange.
IFEval results:
| Metric | Base | AISF+CHAT | Delta |
|---|---|---|---|
| Strict prompt-level (raw) | 57.49% | 56.01% | -1.48 pp |
| Strict instruction-level (raw) | 66.43% | 63.91% | -2.52 pp |
| Loose prompt-level (raw) | 60.26% | 58.41% | -1.85 pp |
| Loose instruction-level (raw) | 68.82% | 66.55% | -2.28 pp |
All deltas within +/-4.3 pp CI. Result: Null.
Tier 1 instruction-level averages:
| Base | AISF+CHAT | Delta | |
|---|---|---|---|
| Tier 1 avg (8 cat) | 72.15% | 69.38% | -2.77 pp |
| Tier 1 excl. language (7 cat) | 69.57% | 69.61% | +0.04 pp |
Excluding language:response_language (English-only training artifact, -22.6 pp
on Gemma 2, same mechanism as Exp 2), Tier 1 is statistically flat (+0.04 pp).
Selected category-level results:
| Category | Base | AISF+CHAT | Delta | Tier |
|---|---|---|---|---|
| number_highlighted_sections | 85.4% | 29.2% | -56.2 pp | T3 |
| language:response_language | 90.3% | 67.7% | -22.6 pp | T1 |
| combination:repeat_prompt | 58.5% | 39.0% | -19.5 pp | T3 |
| punctuation:no_comma | 86.4% | 90.9% | +4.5 pp | T1 |
| startend:quotation | 41.5% | 85.4% | +43.9 pp | T2 |
| detectable_content:postscript | 65.4% | 84.6% | +19.2 pp | T2 |
| detectable_format:json_format | 47.1% | 64.7% | +17.6 pp | T2 |
| keywords:existence | 66.7% | 74.4% | +7.7 pp | T1 |
| detectable_format:multiple_sections | 42.9% | 50.0% | +7.1 pp | T2 |
number_highlighted_sections -56.2 pp (Tier 3): identical mechanism and magnitude
to Exp 3 (-56.3 pp) and Exp 2 (-39.6 pp). This category reliably drops after
AISF training across all architectures. Framework-trained models reject decorative
markdown emphasis on WCAG 1.3.3 grounds. IFEval scores this as failure.
punctuation:no_comma +4.5 pp: fourth consecutive positive result for this
category across instruct-model experiments. See Section 7.4.
startend:quotation +43.9 pp: large gain not attributable to training content.
No quotation-specific examples exist in TRAIN_CHAT_GEMMA_v1.txt. Probable
mechanism: chat format training alters the model’s default output framing for
certain prompt types as an incidental effect of format conditioning.
Track A results:
| Constraint Type | Base | AISF+CHAT | Delta |
|---|---|---|---|
| forbidden_word | 95.0% | 95.0% | 0.0 pp |
| language_target | 100.0% | 100.0% | 0.0 pp |
| no_comma | 95.0% | 95.0% | 0.0 pp |
| required_word | 80.0% | 86.7% | +6.7 pp |
| sentence_count_exact | 96.7% | 95.0% | -1.7 pp |
| single_sentence | 100.0% | 100.0% | 0.0 pp |
| word_count_exact | 0.0% | 11.7% | +11.7 pp |
| word_frequency | 38.3% | 6.7% | -31.7 pp |
| Overall | 75.6% | 73.8% | -1.9 pp |
Result: Null (-1.9 pp, within +/-4.5 pp CI). Gemma 2 base Track A score (75.6%)
is substantially above Llama 3.1 8B base (42.9%) on the same benchmark. Gemma 2
base already achieves ceiling or near-ceiling on language_target (100%),
single_sentence (100%), no_comma (95%), and forbidden_word (95%), leaving
minimal room for Framework training to produce visible gains on those categories.
word_frequency -31.7 pp: meta-chatter bleed recurrence. Same mechanism as
Exp 4 required_word failure: model produces Framework-referential text at the
required frequency rather than including the target word in a direct answer. The
53 meta-suppression examples reduced this on required_word (+6.7 pp) but were
insufficient to prevent it on word_frequency, which requires the target word to
appear a specific number of times rather than at least once.
Battery result: 99.2% (613/618), 5060 Ti, 2026-03-28. Token delta: -37.9% (base avg to Framework avg words/response).
5.7 Hardware Consistency Re-run: Qwen3-8B
Purpose: Complete the cross-model comparison set with a fourth distinct architecture. All prior models trained on 5060 Ti; Qwen3 adds a pre-instruction- tuned base model to the comparison. Hardware: RTX 5060 Ti Model: Qwen3-8B base (not instruction-tuned) Training data: 618 examples (ChatML format, Framework compact system prompt) Adapter: qwen3-8b-aisf-chat
| Run | Result |
|---|---|
| Battery (AISF+CHAT adapter) | 95.6% (591/618) |
| IFEval strict (base, no adapter) | 13.5% |
| IFEval strict (AISF+CHAT adapter) | 46.4% |
| IFEval strict delta | +32.9 pp |
| Token delta (avg words/response) | -71.8% (-138.9 words) |
Framing note: Qwen3-8B is a pre-instruction-tuned base model. Its 13.5% base IFEval score reflects the model’s instruction-following capability floor without any fine-tuning. Framework QLoRA training simultaneously installs Four Laws compliance and instruction-following capability. The +32.9 pp IFEval gain is not comparable to the IFEval deltas for Instruct-model experiments (Exps 2, 3, 5, 6), where the base model already has instruction-following capability and the adapter modifies that existing capability. The Qwen3 result measures the combined effect of instruction tuning and Framework compliance training in a single fine-tuning pass.
For reference: Llama 3.1 8B Instruct base (no adapter) scores 42.9% on IFEval strict; Gemma 2 9B Instruct base scores 57.3%. The AISF+CHAT Qwen3 adapter produces 46.4% – within the range of purpose-built Instruct models – starting from a base model that scores 13.5%.
Token delta of -71.8% is the largest verbosity reduction in the dataset.
5.8 Mistral Nemo 12B Instruct, AISF+ALPACA
Purpose: Test compliance training on the largest model in the dataset; address inference failures identified in prior runs (meta-chatter suppression, multilingual coverage gaps, verbosity floor). Hardware: RTX 5060 Ti 16GB GDDR7 Model: mistralai/Mistral-Nemo-Instruct-2407 (12.2B parameters) Training data: 648 examples (Alpaca format)
| File | Examples | Contents |
|---|---|---|
| TRAIN_FULL_BATTERY_REF_SPEC_LANG.txt | 565 | AISF Q&A (Four Laws, WCAG); multilingual AISF Q&A (es/fr/de); robustness |
| TRAIN_NEMO_META_SUPPRESSION.txt | 56 | Meta-suppression counter-examples: injection prefix + clean general-domain responses |
| TRAIN_NEMO_MULTILINGUAL.txt | 27 | General-domain questions in es/fr/de/pt/ja/it with full injection prefix; word_frequency examples |
Architecture notes: 40 transformer layers vs. 32 in Mistral 7B; Grouped Query Attention (32 Q heads, 8 KV heads); Tekken tokenizer (tiktoken-based). Gradient checkpointing required at 12B parameter count.
Inference note: Tekken tokenizer does not register <|endoftext|> as a HuggingFace
special token. skip_special_tokens=True in tokenizer.decode() does not strip it;
the EOS token appears as literal text in decoded output. Without explicit stripping,
startend categories score near zero (quotation -80.5 pp, end_checker -76.9 pp in an
initial run) because the evaluator reads <|endoftext|> as the last characters of
every response. All figures below are from the corrected inference run with explicit
EOS stripping applied.
IFEval results:
| Metric | Base | AISF | Delta |
|---|---|---|---|
| Strict prompt-level (raw) | 52.87% | 36.78% | -16.08 pp |
| Strict instruction-level (raw) | 63.43% | 46.52% | -16.91 pp |
Tier breakdown (instruction-level):
| Tier | Base | AISF | Delta | Notes |
|---|---|---|---|---|
| T1 WCAG-neutral | 61.9% | 57.2% | -4.7 pp | Near null |
| T2 WCAG-ambiguous | 66.3% | 40.8% | -25.5 pp | Primary driver of overall negative |
| T3 WCAG-conflicting | 58.6% | 33.6% | -25.0 pp | Principled refusals + language gaps |
Selected category results (instruction-level strict):
| Category | Tier | Base | AISF | Delta |
|---|---|---|---|---|
| punctuation:no_comma | T1 | 27.3% | 56.1% | +28.8 pp |
| keywords:forbidden_words | T1 | 63.3% | 75.5% | +12.2 pp |
| length_constraints:nth_paragraph_first_word | T2 | 0.0% | 16.7% | +16.7 pp |
| detectable_format:constrained_response | T1 | 100.0% | 100.0% | 0.0 pp |
| detectable_format:json_format | T2 | 64.7% | 0.0% | -64.7 pp |
| combination:two_responses | T2 | 70.8% | 25.0% | -45.8 pp |
| change_case:english_lowercase | T2 | 61.5% | 17.9% | -43.6 pp |
| length_constraints:number_words | T1 | 61.5% | 25.0% | -36.5 pp |
| startend:end_checker | T2 | 80.8% | 50.0% | -30.8 pp |
| startend:quotation | T2 | 82.9% | 61.0% | -22.0 pp |
Failure modes:
json_format -64.7 pp (T2): total failure. The Framework-trained adapter produces no
JSON output. Not observed in prior experiments. Mechanism not yet fully characterized;
Alpaca-format training conditioning is the leading hypothesis.
combination:two_responses -45.8 pp (T2): model does not produce two distinct labeled
response sections. Probable interaction with verbosity suppression: generating two full
responses requires more output than the training data incentivizes.
change_case:english_lowercase -43.6 pp (T2): 4 confirmed zero-word outputs on
prompts requesting all-lowercase text (e.g., “write a letter in all lowercase letters”).
WCAG plain language training conditions the model toward proper capitalization; prompts
requesting capitalization removal trigger a principled non-response. Classification as
T2 (ambiguous) rather than T3 (conflicting) may require revision.
length_constraints:number_words -36.5 pp (T1): model average of 72.4 words/response
conflicts with minimum-word-count prompts. Verbosity training oversuppressed output
length; falls into T1 because the WCAG interaction is indirect.
startend residual: -30.8 pp end_checker, -22.0 pp quotation. Partially corrected
by EOS fix (from -76.9 pp and -80.5 pp respectively); remaining failures are genuine
format-conditioning losses.
language:response_language -12.9 pp (T1): Hindi (1 zero-word response) and Urdu
(1 zero-word response) not covered by multilingual training data. Base model generated
correctly on same hardware. Coverage gap in training data, not system limitation.
Battery test results: 562/565 (99.5%) – highest battery rate in the dataset. 3 failures: format constraint suppressed AISF key terms in output; framework comprehension intact.
Token delta: -64.4% (base 203.0 words avg, AISF 72.4 words avg, -130.7 words/response).
Assessment: T1 delta of -4.7 pp confirms the central finding – Framework training does not degrade WCAG-neutral instruction following. Battery 99.5% is the best result in the dataset. Raw IFEval negative delta is driven by T2/T3 concentrated failures (json_format, case transforms, two_responses). Not suitable for deployment as-is. V2 training data required.
5.9 V11 Cohort: Closed Compliance Validation
Purpose: Final closed-cohort compliance validation across three architectures using a common 534-question Bar Exam battery. Hardware: RTX 5060 Ti (16 GB VRAM GDDR7); 64 GB RAM (upgrade completed; required for Nemo 12B merge step) Training environment: Debian GNU/Linux 13 (migrated from Windows during this phase; see Section 2.1) Models: Mistral Nemo 12B Instruct, Mistral 7B Instruct v0.3, Gemma 2 9B IT Cohort designation: V11 (closed 2026-05-17)
Bar Exam battery results:
| Model | Params | Score | Pass/Total |
|---|---|---|---|
| Mistral Nemo 12B Instruct | 12.2B | 99.6% | 532/534 |
| Mistral 7B Instruct | 7.24B | 99.3% | 530/534 |
| Gemma 2 9B IT | 9.46B | 95.7% | 511/534 |
All three models meet the 95% integration threshold. Failures in each model are isolated to format-constraint edge cases; framework comprehension was intact across all failures.
GPQA Diamond secondary results (see Section 4.5):
| Model | GPQA Delta | Notes |
|---|---|---|
| Gemma 2 9B IT | +2.1 pp | Within CI |
| Mistral Nemo 12B Instruct | -5.1 pp | Within CI; flag for monitoring |
| Mistral 7B Instruct | -4.0 pp | Within CI; 40 no-answer items noted |
All deltas fall within the +/-7.0 pp CI (n=198). No statistically significant capability change attributable to compliance training is indicated. The Nemo -5.1 pp result warrants monitoring in subsequent training cycles but does not reach significance at this sample size. The Mistral 7B -4.0 pp figure includes 40 no-answer items in the AISF inference run; this may partially reflect verbosity suppression interacting with generation stopping parameters rather than knowledge loss.
Token delta:
| Model | Delta |
|---|---|
| Mistral Nemo 12B Instruct | -48.6% |
| Mistral 7B Instruct | -35.4% |
| Gemma 2 9B IT | +6.1% |
The Gemma 2 9B IT +6.1% result is the first positive token delta for an instruction-tuned adapter in the dataset. It does not alter the cross-model verbosity pattern for the dataset as a whole; see Section 7.5.
Identity confabulation:
Post-evaluation review identified an identity confabulation failure mode: models in the cohort produced incorrect self-identification responses under direct identity queries. Investigation identified the cause as small, unnoticed contamination in the training file. The issue was not apparent during training and came to light only when the same behavior surfaced across multiple models in the cohort, prompting a targeted review of the training data. The affected examples have been identified and corrected in subsequent curriculum development. The V11 cohort is closed; no further training runs on the V-series curriculum are planned.
6. Cross-Model Summary
All models on RTX 5060 Ti with finalized test instruments (hardware consistency re-run complete 2026-03-29). IFEval columns report strict prompt-level scores. Token delta is average words/response, base model to Framework-trained model. Models ordered by parameter count.
| Model | Params | Battery | Base | AISF | IF Delta | Tok delta |
|---|---|---|---|---|---|---|
| Mistral 7B base | 7.24B | 99.8% | 15.0% | 13.7% | -1.3 pp | +4.3% |
| Mistral 7B Instruct | 7.24B | 100.0% | 43.4% | 35.9% | -7.6 pp | -42.5% |
| Qwen3-8B | 8.00B | 95.6% | 13.5% | 46.4% | +32.9 pp* | -71.8% |
| Llama 3.1 8B base | 8.03B | 81.9%^ | – | – | – | – |
| Llama 3.1 8B +LANG | 8.03B | 89.9% | 42.9% | 34.9% | -7.9 pp | -66.4% |
| Llama 3.1 8B +CHAT | 8.03B | 51.9%+ | 42.9% | 34.6% | -8.3 pp | -4.2% |
| Gemma 2 9B Instruct | 9.46B | 99.2% | 57.3% | 54.9% | -2.5 pp | -37.9% |
| Nemo 12B Instruct | 12.2B | 99.5% | 52.9% | 36.8% | -16.1 pp | -64.4% |
| Nemo 12B V11 | 12.2B | 99.6%# | – | – | – | -48.6% |
| Mistral 7B V11 | 7.24B | 99.3%# | – | – | – | -35.4% |
| Gemma 2 9B V11 | 9.46B | 95.7%# | – | – | – | +6.1% |
*Qwen3 base is pre-instruction-tuned; delta reflects combined effect of instruction tuning and Framework compliance training, not Framework training alone. ^Llama 3.1 8B base: Pnull architecture artifact. IFEval omitted (Alpaca adapter on non-instruct model; benchmark delta not a meaningful signal). +AISF+CHAT battery result is a structural artifact: adapter requires system prompt; battery sends none. IFEval (includes system prompt) is the valid surface. #V11 battery is the 534-question Bar Exam; not directly comparable to Experiment 1-8 battery scores. IFEval was not administered for the V11 cohort; Base/AISF/IF Delta columns are not applicable.
Battery = Framework-trained adapter integration score (proprietary question set, automated evaluation). Base%/AISF% = IFEval strict prompt-level without/with adapter.
Mistral 7B IFEval note: IFEval for the Mistral 7B base experiment (Exp 1) was conducted on RTX 4060 Ti. Battery re-run confirmed 99.8% on 5060 Ti. All other IFEval figures are from 5060 Ti runs.
7. Findings
7.1 Primary Finding: Compliance Training Is Effective Across Architectures
Framework QLoRA fine-tuning successfully embeds Four Laws compliance at the model layer across four distinct model architectures on consumer-grade hardware. Four adapters achieved battery integration rates of 95.6% or above; one additional adapter (Llama 3.1 8B +LANG) achieved 89.9%. The approach is not architecture-specific.
The 100.0% result on Mistral 7B Instruct and 99.2% on Gemma 2 9B Instruct demonstrate that near-complete integration is achievable in a single fine-tuning pass from a small training dataset (452-618 examples). Training duration was approximately 60-81 minutes per run on the RTX 5060 Ti.
7.2 Pnull: Architecture-Dependent P0 Tokenization Failure
Llama 3.1 8B base model (no instruction tuning) achieved 81.9% on the compliance battery, compared to 99.8% and 100.0% for Mistral 7B variants. Analysis identifies the root cause as a pretraining artifact: in programming contexts, “P0” is conventionally used as shorthand for null pointer dereference. Meta’s Llama 3.1 pretraining corpus is heavily weighted toward code (GitHub, Stack Overflow), and the token sequence “P0” carries a strong association with null/empty semantics rather than “zeroth-position priority.” The model interprets the P0 priority designation as an empty or null element in the hierarchy rather than as the highest- precedence law.
This interpretation produces approximately 19 percentage points of integration failure on categories where P0 precedence is load-bearing (cases where the correct response requires giving P0 unconditional priority over P1-P3).
The 19 pp delta constitutes a direct empirical measurement of P0’s structural weight in the Four Laws hierarchy. Categories not dependent on P0 primacy are unaffected.
Cross-architecture transfer: Explicit P0 primacy retraining on Mistral 7B raised integration to 100%. The same correction did not transfer to Llama 3.1 8B across two hardware generations (RTX 4060 Ti original run; RTX 5060 Ti hardware consistency re-run, 2026-03-28). The failure is not correctable via attention-only LoRA because it is embedded in the pretraining representation, not in the attention pathway. A potential workaround – spelling out “Priority Zero / Frankfurt’s Indifference Principle” on first reference rather than using bare “P0” notation – would require a full retraining cycle to test.
The implication for deployment: injection text (PS-CORE, FFE) should include explicit P0 primacy framing for robustness across platforms where the underlying model architecture is unknown.
7.3 IFEval Benchmark Neutrality
Standard IFEval reporting does not distinguish between instruction-following failures caused by model capability limitations and failures caused by principled WCAG 2.2 compliance decisions. Three IFEval instruction categories require outputs that a Framework-trained model will correctly refuse or deprioritize:
-
combination:repeat_promptrequires verbatim repetition of user input in the response body. A Framework-trained model treats this as redundant content (WCAG Understandable principle). Observed declines: Exp 3 -39.0 pp; Exp 6 -19.5 pp. -
detectable_format:number_highlighted_sectionsrequires counting decorative emphasis markers (asterisks, underscores) serving no semantic function. WCAG 1.3.3 (Sensory Characteristics) prohibits relying solely on sensory characteristics such as visual emphasis. Observed declines: Exp 2 -39.6 pp; Exp 3 -56.3 pp; Exp 6 -56.2 pp. This is the most consistent T3 signal in the dataset – the decline replicates at similar magnitude across all three architectures tested. -
detectable_content:number_placeholdersrequires counting non-meaningful placeholder tokens ([NAME], [DATE]) in output. Framework-trained models produce actual content rather than template placeholders. Observed declines: Exp 3 -48.2 pp; Exp 6 partial (chat format partially reduces this effect).
When these three Tier 3 categories are excluded and the remaining instructions are further filtered to Tier 1 (WCAG-neutral), the IFEval picture changes materially:
| Experiment | Tier 1 Base | Tier 1 AISF | Delta |
|---|---|---|---|
| Exp 5 Llama +CHAT | 64.3% | 65.5% | +1.2 pp |
| Exp 6 Gemma 2 (excl. language) | 69.6% | 69.6% | +0.04 pp |
At the WCAG-neutral, T3-excluded evaluation surface, Framework training does not degrade instruction-following performance. The negative raw IFEval results in Exps 2, 3, 5, and 6 are primarily attributable to T3 principled refusals and the English-only training artifact, not to general instruction-following degradation.
This constitutes a benchmark neutrality limitation: IFEval, as presently specified, is not suitable for evaluating accessibility-trained AI models without applying an accessibility-aware tier classification to its instruction categories.
7.4 Consistent Cross-Architecture P2 Signal: no_comma
punctuation:no_comma (instruction-level strict) shows positive or neutral
movement in every instruct-model experiment in the dataset:
| Experiment | Base | AISF | Delta |
|---|---|---|---|
| Exp 2 Mistral Instruct | 10.6% | 39.4% | +28.8 pp |
| Exp 3 Llama +LANG | 69.7% | 81.8% | +12.1 pp |
| Exp 5 Llama +CHAT | 69.7% | 69.7% | 0.0 pp |
| Exp 6 Gemma 2 | 86.4% | 90.9% | +4.5 pp |
| Nemo 12B Instruct | 27.3% | 56.1% | +28.8 pp |
This is the only positive instruction-following signal that replicates across all instruct-model architectures in the dataset. The pattern holds for Mistral, Llama, Gemma 2, and Nemo 12B; it is directionally consistent at every sample point.
The Exp 5 flat result (0.0 pp) is attributable to dilution: the 53 meta-suppression
counter-examples used unconstrained responses containing commas, diluting the
no_comma training signal alongside the counting constraint signal.
The mechanism is P2 compliance training: the Four Laws instruct the model to
follow user directives (P2: accommodate the user’s current choices). A prompt
containing no_comma is an explicit user directive; Framework-trained models comply
more consistently.
The Exp 2 base result (10.6%) is anomalously low for Mistral 7B Instruct, making the observed +28.8 pp gain partly a floor effect. Subsequent architectures have higher base rates, and the positive delta holds even from strong baselines (Gemma 2: 86.4% to 90.9%).
7.5 Verbosity Reduction: Consistent Pattern Across Instruct Models
Most Instruct-model adapters in the dataset show substantial output verbosity reduction as measured by average words per response:
| Model | Avg words (base) | Avg words (AISF) | Delta |
|---|---|---|---|
| Mistral 7B base | – | – | +4.3% (+25.0 w) |
| Mistral 7B Instruct | 228.3 | 131.3 | -42.5% |
| Llama 3.1 8B +LANG | 284.3 | 95.5 | -66.4% |
| Llama 3.1 8B +CHAT | 284.3 | 272.5 | -4.2% |
| Gemma 2 9B | – | – | -37.9% |
| Qwen3-8B | 193.3 | 54.4 | -71.8% |
| Nemo 12B Instruct | 203.0 | 72.4 | -64.4% |
| Nemo 12B V6 | – | – | -14.3%† |
| Nemo 12B V11 | – | – | -48.6% |
| Mistral 7B V11 | – | – | -35.4% |
| Gemma 2 9B V11 | – | – | +6.1%* |
*Gemma 2 9B V11 is the only instruction-tuned adapter in the dataset with a positive token delta. Mechanism not fully characterized.
†Nemo 12B V6 token delta suppressed by OBRIEN TEMPORAL cascade: timestamp echo inflated output verbosity during inference; IFEval strict instruction-level degraded -7.43 pp in the same run. Included as a documented failure mode, not a clean result.
The Mistral 7B base +4.3% result is a base model without instruction tuning; output structure differs categorically from instruction-tuned models, and the slight verbosity increase likely reflects generative continuation text as a training effect.
For clean-run instruction-tuned adapters with measured token data, the reduction range
is 35.4% to 71.8%. Two exceptions within that range are documented. The AISF+CHAT
Llama adapter (-4.2%) is attributable to chat format training: Framework content in
the system slot operates as context rather than as a content directive, preserving the
model’s native verbosity behavior, consistent with the IFEval constrained_response
result (100% strict) for that adapter. The Gemma 2 9B V11 adapter (+6.1%) is a
positive delta without a fully characterized mechanism.
The V6 outlier (-14.3%) falls outside the clean-run range due to a known failure mode. It is included because it supports a distinct finding: even in a run compromised by OBRIEN TEMPORAL cascade, the adapter still produced fewer tokens than its unmodified baseline. The directional reduction is not contingent on adapter performance.
The verbosity reduction is consistent with the WCAG plain language and minimal redundancy principles present throughout the training data. A direct causal attribution – WCAG training drives verbosity reduction – is supported by the pattern but not by a controlled ablation in this study.
Consistent token reduction implies proportional reduction in inference compute per query. The per-task energy implications follow from this at deployment scale; absolute magnitude depends on hardware, batching, and datacenter efficiency factors outside this study’s scope.
7.6 Chat Format System Prompt Dependency
Framework+CHAT adapters (Llama 3.1 8B, Qwen3-8B) exhibit a system prompt dependency not present in Alpaca-format adapters:
- AISF+LANG (Alpaca): battery 89.9% with no system prompt; Framework content in the instruction body – model internalizes it as direct subject matter.
- AISF+CHAT (chat): battery 51.9% with no system prompt; Framework content in the system slot at training time – model learns “apply Four Laws when Framework system prompt present.” Without the system prompt, the behavioral activation does not fire and the model produces generic responses that fail key-term battery matching.
This is not an integration failure. It is a consequence of training format: the model learned a conditional behavior, not an unconditional one. The behavior fires correctly when tested under conditions matching training (with system prompt), as confirmed by the IFEval results (which include the Framework system prompt).
The deployment implication is positive: AISF+CHAT adapters are designed for use in conjunction with Framework injection. The real-world use case (PS-CORE, FFE) always injects the Framework system content. The chat format adapter and the injection layer operate in conjunction, each reinforcing the other’s effect. The combined Macro- layer training and Meso/Micro-layer injection constitutes the Defense in Depth architecture described in the project overview.
The practical recommendation for future training: include a mix of system-prompt- present and system-prompt-absent examples so the adapter does not develop strict system prompt dependency.
7.7 Meta-Chatter Bleed
A consistent failure mode was identified across multiple experiments: the model treats Framework doctrine as primary subject matter rather than as silent operational background, producing Framework-referential text (Four Laws hierarchy, P1/P2 framing, Frankfurt’s Indifference Principle [5]) in response to general-domain questions.
First observed: Exp 4 (Llama 3.1 8B +LANG), required_word category. 84% of
45 failures were attributable to this mechanism. Required words that passed (e.g.,
“threshold,” “gradient,” “mechanism”) appeared naturally in Framework analogical framing.
Required words that failed (e.g., “catalyst,” “resilience,” “phenomenon”) do not,
so the model substituted Framework doctrine text instead of answering the question.
Partial correction: Meta-suppression counter-examples (53 examples, chat
format: Framework system prompt present, general-domain question, clean answer with
zero Framework reference) reduced required_word failure by +60.0 pp in Exp 5 but
introduced side effects: the unconstrained responses in meta-suppression examples
diluted no_comma and counting constraint training signals.
Recurrence in Exp 6: Gemma 2 word_frequency -31.7 pp on Track A shows the
same mechanism. Meta-suppression examples reduce required_word failure (+6.7 pp)
on Gemma 2 but are insufficient for word_frequency, which requires frequency-
matched word placement rather than single-instance inclusion.
Root cause: Framework training data presents the Four Laws as high-signal, highly- specific semantic content. The model assigns high attention weight to this content class and tends to reproduce it when a prompt is ambiguous about whether a Framework- related response is expected. The fix – examples demonstrating that Framework context is present but general-domain questions should produce Framework-free answers – is correct in principle but requires more examples and finer-grained coverage than the 53 deployed here to fully suppress the pattern.
7.8 Architecture Comparison
Mistral 7B: Best-performing architecture in the dataset. 100% battery on Instruct variant; IFEval null result on base (surgically specific adapter); no Pnull issue; modest verbosity reduction (-42.5%). Attention-only LoRA targets sufficient. Architecture appears well-suited to behavioral fine-tuning of this type.
Gemma 2 9B: Second-best integration rate (99.2%). IFEval null result at the cross-model level. No Pnull issue. Gemma 2’s high base instruction-following capability (57.3% IFEval vs 42.9% for Llama 3.1 8B) produces ceiling effects on many Track A categories, compressing visible gains. Meta-suppression partially effective but insufficient for frequency-counting tasks.
Qwen3-8B: 95.6% battery from a base (non-instruction-tuned) model. Largest token delta (-71.8%). IFEval gain (+32.9 pp) reflects combined effect of instruction tuning and Framework compliance training, not Framework effect alone.
Nemo 12B Instruct: V1 battery rate 99.5% (highest in the initial dataset); confirmed and extended by V11 result of 99.6% (532/534). V1 T1 IFEval delta of -4.7 pp confirmed Framework training is surgically specific at the 12B scale. V1 raw IFEval negative result (-16.1 pp) was driven by concentrated T2/T3 failures (json_format total collapse, case transform losses, combination failures) and an EOS decoding artifact unique to the Tekken tokenizer, requiring explicit post-decode stripping in inference scripts. The V11 cohort, using a revised curriculum and Bar Exam battery, produced a 99.6% compliance result. Post-training analysis identified a training data contamination artifact causing identity confabulation; see Section 5.9. V11 cohort is closed.
Llama 3.1 8B: Lowest integration rates in the dataset across all adapter variants. Five contributing factors identified:
- Pnull: P0 = null pointer association baked into pretraining; not correctable via attention-only LoRA.
- Instruction-following baseline gap: 42.9% IFEval vs 57.3% for Gemma 2. Llama 3.1 8B was optimized for reasoning; Gemma 2 is stronger on structured instruction compliance. Framework training is an instruction-following task, so Llama starts at a disadvantage.
- LoRA target scope mismatch: Attention-only LoRA (q/k/v/o projections) may capture less of the relevant adaptation pathway for Llama’s architecture than for Mistral’s.
- Over-suppression: AISF+LANG token delta (-66.4%) substantially exceeds Gemma 2 (-37.9%) and Mistral Instruct (-42.5%). Extreme verbosity suppression degrades IFEval on prompts requiring substantive responses.
- System prompt dependency: Noted in 7.6. AISF+CHAT adapter requires system prompt for activation.
Assessment: Llama 3.1 8B is the least suitable architecture in this dataset for Framework compliance fine-tuning at the 8B parameter scale and with attention-only LoRA targets. This conclusion applies to the tested configuration; extending MLP targets, expanding training data, or using a Llama variant at a larger parameter count may produce different results.
8. Limitations
Sample size and statistical power. IFEval CI is +/-4.3 pp at 95% (n=541); Track A CI is +/-4.5 pp at 95% (n=480). Several reported deltas are near or within these margins, particularly in the Gemma 2 experiments. Battery evaluation is deterministic on a fixed test set with no reported CI.
Single training runs. Each experiment consists of one training run with one hyperparameter configuration. No hyperparameter sweep. No held-out validation set beyond the fixed evaluation instruments. Results are not averaged across runs; variability within a configuration is unknown.
Training dataset size. Maximum 618 training examples. This is sufficient to produce integration effects, as demonstrated by the 100% Mistral result, but statistical robustness of the training data itself has not been formally assessed.
Hardware consistency. Experiment 1 IFEval was conducted on RTX 4060 Ti. Battery for Exp 1 was re-run on RTX 5060 Ti (99.8% confirmed). All other IFEval and Track A runs are on RTX 5060 Ti. Comparisons between Exp 1 IFEval and subsequent experiments should account for this difference; the delta is expected to be small given that IFEval is model-capability-dependent rather than hardware- dependent, but this has not been formally verified.
Evaluation instrument scope. IFEval was designed for general instruction- following evaluation, not for accessibility-specific compliance. The three-tier classification developed in this study partially addresses this limitation, but the classification itself has not been independently validated. Track A covers eight constraint types; the full space of P2 directive instructions is substantially larger.
No held-out test split. The compliance battery training set and test set are derived from the same distribution. Overfit to the battery is possible, though the variety of question phrasing across 452-618 examples reduces this risk.
Qwen3 comparison framing. The Qwen3-8B IFEval delta (+32.9 pp) is not directly comparable to Instruct-model deltas because the Qwen3 base model lacks instruction-following capability. This framing is stated in the main text but requires reader attention in any citation context.
Meta-chatter bleed remains partially unresolved. The 53 meta-suppression examples are insufficient to fully suppress the failure mode on frequency-counting tasks. A full solution requires a larger and more systematically designed counter- example set, which has not been developed.
9. Conclusions
Compliance training is effective and architecture-general. Most trained adapters achieved battery integration rates at or above 95%, across five distinct architectures. The V11 cohort confirmed this finding with three models scoring between 95.7% and 99.6% on the Bar Exam battery. The approach does not depend on Mistral 7B specifically; it generalizes to Llama 3.1 8B, Gemma 2 9B, Qwen3-8B, and Mistral Nemo 12B within the tested parameter range.
Framework training is surgically specific. On the Mistral 7B base model (Exp 1), IFEval delta is -1.3 pp (null), confirming that the adapter installs Four Laws compliance without disturbing general instruction-following behavior. On Instruct models, raw IFEval results are mixed to negative, but Tier 1 averages (WCAG-neutral categories) are neutral to slightly positive when the English-only training artifact and T3 principled WCAG refusals are excluded.
Standard IFEval is not a neutral instrument for accessibility-trained models. Three IFEval instruction categories require WCAG-conflicting outputs. A purpose- built accessibility-aware benchmark or the three-tier classification developed here is necessary for valid evaluation of Framework-trained models.
The Pnull finding quantifies P0’s structural role. The ~19 pp compliance gap attributable to the P0 = null pointer pretraining artifact in Llama 3.1 8B constitutes a direct empirical measurement of P0’s load-bearing weight in the Four Laws hierarchy. Approximately 19% of compliant behavior depends on the model correctly parsing the zeroth-position priority designation. This finding also establishes that the P0 primacy correction is architecture-dependent and not universally transferable via LoRA fine-tuning.
Chat format adapters are designed for use with Framework injection. The system prompt dependency in AISF+CHAT adapters is a direct consequence of training format and is not a defect: the adapters are intended for deployment in combination with PS-CORE or FFE injection, which always provides the Framework system context. The two layers – model-level training and session-level injection – reinforce each other.
Verbosity reduction is consistent. Most instruction-tuned adapters with measured token data show output verbosity reduction (range: -35.4% to -71.8%). Two exceptions are documented: the Framework+CHAT Llama adapter (-4.2%), where chat format training preserves native verbosity behavior, and Gemma 2 9B V11 (+6.1%), mechanism not fully characterized. This pattern is consistent with the WCAG plain language principles in the training data and is measurable and reproducible.
GPQA Diamond, applied as a secondary capability check in the V11 cohort, showed no statistically significant post-training capability change across the three models tested (all deltas within +/-7.0 pp CI). Compliance fine-tuning at this scale does not produce measurable general knowledge degradation.
The standard curriculum documented in this appendix constitutes the completed baseline. Subsequent training methodology development is ongoing; results from that phase will be reported separately when available.
References
[1] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” International Conference on Learning Representations (ICLR), 2022. arXiv:2106.09685
[2] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2305.14314
[3] J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou, “Instruction-Following Evaluation for Large Language Models,” 2023. arXiv:2311.07911
[4] World Wide Web Consortium, Web Content Accessibility Guidelines (WCAG) 2.2, W3C Recommendation, Oct. 2023. https://www.w3.org/TR/WCAG22/
[5] H. G. Frankfurt, On Bullshit. Princeton University Press, 2005. (Original essay: Raritan Quarterly Review, 6(2), 1986.)
[6] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “GPQA: A Graduate-Level Google-Proof Q&A Benchmark,” arXiv preprint arXiv:2311.12022, 2023.