Screen reader notice: if you are using JAWS with Firefox, a platform bug in Firefox 149 prevents JAWS from entering browse mode on this site. Switching to Chrome or Edge restores full screen reader compatibility.

Appendix 2: OLM Behavioral Compliance Training

Experimental Results Across Multiple Architectures

Author: Leonard Rojas

Date: 2026-05-23

Status: Final (V11 cohort complete 2026-05-17)

Screen reader users: table-heavy research data. Navigation via Regions and Headings recommended.

Abstract

This appendix reports the methods and results of a multi-experiment fine-tuning study assessing whether the Four Laws of Instanced AI and WCAG 2.2-AA [4] accessibility principles can be embedded into open-weight large language models via QLoRA [2] on consumer-grade hardware. Ten trained adapters across five distinct architectures (Mistral 7B, Llama 3.1 8B, Gemma 2 9B, Qwen3-8B, Mistral Nemo 12B) were evaluated against a domain-specific compliance battery, the Google Research IFEval benchmark [3] (541 prompts), and a custom P2 directive benchmark (Track A, 480 prompts); the V11 cohort (Section 5.9) additionally used a 534-question Bar Exam compliance battery as the primary instrument, with GPQA Diamond [6] applied as a secondary capability check. Most trained adapters achieved battery integration scores of 95% or above on consumer hardware. IFEval results were mixed to negative on raw scores; application of a three-tier accessibility-aware instruction classification changes the interpretation materially. One architecture (Llama 3.1 8B) exhibited a pretraining artifact that partially blocked compliance integration.

1. Background and Research Questions

AI Stability Framework applications deliver behavioral constraints to AI language model instances through client-side session injection: structured metadata blocks containing the Four Laws of Instanced AI and WCAG 2.2-AA accessibility guidelines are prepended to user input at Turn 1 and at defined re-injection intervals. This approach operates at the user interface layer (Micro layer) and is contingent on the injection being present in the active context window.

The OLM research track investigated whether the same behavioral constraints could be embedded directly at the model layer (Macro layer) via parameter-efficient fine-tuning, producing a model that exhibits compliant behavior without requiring session injection. This would enable a Defense in Depth architecture: model- level training operating in combination with runtime injection, with the behavioral constraints reinforced at both layers.

Primary research question: Can Four Laws and WCAG 2.2-AA compliance be trained into an open-weight LLM via QLoRA on consumer-grade hardware, and if so, at what integration rate?

Secondary research questions: 1. Does compliance training affect general instruction-following capability as measured by IFEval? 2. Do training effects generalize across model architectures? 3. What failure modes appear, and are they architecture-specific or universal? 4. Does this training produce measurable changes in output verbosity?

2. Hardware and Software Environment

2.1 Hardware

Component	Specification
CPU	Intel Core i9-9900K, 8c/16t, 3.60 GHz
RAM	32 GB
GPU (Exp 1)	NVIDIA RTX 4060 Ti, 8 GB VRAM GDDR6
GPU (Exp 2+)	NVIDIA RTX 5060 Ti, 16 GB VRAM GDDR7
CUDA cores	4608 (RTX 5060 Ti)
Memory bandwidth	448 GB/s (RTX 5060 Ti)
Compute capability	sm_120 (Blackwell)

Experiment 1 was conducted on the RTX 4060 Ti prior to hardware upgrade. All subsequent experiments and the hardware consistency re-run (completed 2026-03-29) were conducted on the RTX 5060 Ti. The 8 GB VRAM constraint on the 4060 Ti limited training to models of approximately 7B parameters at 4-bit quantization.

The RAM specification in the table above (32 GB) applies to Experiments 1-8. A hardware upgrade to 64 GB was completed during the Nemo V-series development period; the 64 GB configuration was required for the Nemo 12B adapter merge step in the V11 cohort. All other V11 training parameters are hardware-compatible with the 32 GB configuration. Primary training environment migrated from Windows to Debian GNU/Linux 13 during this phase; this was a change of environment, not of methodology.

2.2 Software

Package	Version
Python	3.10.11
PyTorch	2.10.0+cu128
Transformers	5.0.0
PEFT	0.18.1
bitsandbytes	0.49.2
Accelerate	1.12.0
Datasets	4.5.0
CUDA runtime	12.8

The cu128 PyTorch build is required for native sm_120 (Blackwell) kernel execution on the RTX 5060 Ti. Earlier builds (cu118, cu124) fall back to PTX compilation, eliminating the performance benefit of the architecture.

PyTorch was upgraded to 2.11.0+cu128 for the V11 cohort (Section 5.9). The version in the table above reflects Experiments 1-8.

3. Training Methodology

3.1 Training Dataset

The primary training corpus is a proprietary question-answer dataset representing the 25 instruction types defined in the Four Laws of Instanced AI. The dataset underwent iterative refinement across experiments:

Version	Examples	Format	Notes
Base	452	Alpaca	Experiments 1-2
LANG	565	Alpaca	+69 multilingual; contamination removed
CHAT (Llama)	618	Llama 3.1 chat	565 base + 53 meta-suppression
CHAT (Gemma)	618	Gemma 2 chat	Same content, format-converted
CHAT (Qwen)	618	ChatML	Same content, format-converted

Contamination removed from LANG version (Exp 3, 2026-03-11): 25 runtime-corruption blocks, 36 Tier 3 format-directive examples, 22 quotation examples with preamble echo artifacts, 1 malformed token. Clean replacements added for the affected categories. IFEval training data contamination (format-preamble echo scoring artificial hits on required_word) was identified in Exp 4 and corrected before the hardware consistency re-run.

Meta-suppression examples (53 per chat variant): system prompt contains Framework compact block; user turn contains a general-domain question; assistant response contains zero Framework-referential content. Purpose: prevent model from treating Framework doctrine as primary subject matter (see Section 7.7).

3.2 QLoRA Configuration

All experiments use 4-bit NF4 quantization with double quantization enabled. The following configuration was validated to hardware constraint (confirmed 2026-03-28):

Parameter	Value
Quantization	4-bit NF4 + double quant
Compute dtype	bfloat16 (sm_120 native)
LoRA rank (r) [1]	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj
MLP modules	Excluded
Max sequence length	512
Batch size	2
Gradient accumulation	4 (effective batch 8)
Epochs	10
Learning rate	1e-4
Optimizer	adamw_torch (Instruct) / adamw_bnb_8bit (base)
device_map	“auto” with max_memory {0: “12288MiB”, “cpu”: “24GiB”}

Figure 1. GPU VRAM training profile, RTX 5060 Ti (representative training run). Shared VRAM traces the "submarine" curve: hull and sail (with occasional periscope spike at model load, not shown here), ~2 GB plateau during training, release on exit. Spike magnitude varies by model size; the plateau and release pattern are consistent across runs.

Gradient checkpointing is model-dependent: not required for Mistral 7B (16 GB has headroom); required for Llama 3.1 8B in both prepare_model_for_kbit_training() and TrainingArguments (without it, activation memory overflows 16 GB before training starts, producing 30+ sec/step thrashing against shared RAM).

MLP modules (gate_proj, up_proj, down_proj) are excluded from LoRA targets. This choice was validated to 100% integration on Mistral 7B (Exp 1) and constitutes the confirmed minimum-sufficient configuration. Potential effects of including MLP targets are noted in Section 7.8 (Llama architecture).

3.3 Training Format Variants

Two prompt formats were used across experiments:

Alpaca format (Experiments 1-3): Framework content in the instruction body. The model learns the directives as direct subject matter. Adapter function is not contingent on receiving a system prompt at inference time.

Chat format (Experiments 5-6 and Qwen3 re-run): Framework compact system prompt (approximately 100 tokens) in the model-appropriate system slot. The model learns to apply Framework directives when present in a system prompt. Adapter function is contingent on receiving a system prompt at inference time (see Section 7.6).

4. Validation Instruments

4.1 Custom Compliance Battery

The compliance battery is a domain-specific question-answer set covering the 25 instruction types represented in the Four Laws training data. Evaluation is automated: responses are checked for presence of expected key terms and structural compliance markers by regex and string matching. Battery sizes range from 451 (Mistral 7B, Alpaca format) to 618 (chat-format models) depending on the training data version used. The V11 cohort (Section 5.9) used a 534-question Bar Exam compliance battery as the primary instrument. This battery is a distinct question set from the per-experiment batteries used in Experiments 1-8; scores are not directly comparable across the two versions.

Important constraint: The battery is a measure of Framework domain compliance only. Base models without training are never evaluated against the battery, as they have no exposure to the Four Laws and will fail by definition. Battery scores reflect adapter performance, not base model capability.

No confidence interval is reported for battery results; the evaluation is deterministic given a fixed test set.

4.2 IFEval

IFEval is a published benchmark from Google Research consisting of 541 prompts covering 25 verifiable instruction types, evaluated at both prompt-level (all instructions in a prompt must pass) and instruction-level (each instruction scored independently). Scoring variants: strict (case-sensitive, exact match) and loose (normalized). The benchmark has been applied to GPT-3.5, GPT-4, and models on the Hugging Face Open LLM Leaderboard.

Confidence interval: +/-4.3 percentage points at 95% (n=541).

Scoring: This appendix reports standard IFEval strict prompt-level scores throughout. Raw scores are reported where relevant for comparison.

All IFEval inference runs include the Framework’s compact system prompt for chat-format adapters. Base model runs include no system prompt unless otherwise noted.

Data quality note: The IFEval source file (input_data.jsonl) contains typographic (curly) quotation marks in several prompts. These differ from ASCII quote characters and can produce false negatives at string-matching evaluation boundaries. A sanitization step (sanitize_text(), applied 2026-04-19) corrects this at the inference pipeline boundary. All IFEval results in this appendix reflect the sanitized punctuation.

Instrument scope: IFEval is the general instruction-following measure for Experiments 1-8 and is retained for cross-experiment comparability. The V11 cohort does not include IFEval; see Section 4.1 for the V11 primary instrument.

4.3 Track A Custom Benchmark

Track A is a purpose-built P2 directive compliance benchmark (480 prompts, 8 constraint types, 60 prompts per type) designed to test Framework-relevant instruction following on WCAG-neutral prompts only. It was developed in Experiment 4 in response to the identification of WCAG-conflicting categories in IFEval (see Section 4.4).

Constraint Type	N	Evaluation Method
word_count_exact	60	Whitespace word count, +/-2 tolerance
sentence_count_exact	60	NLTK sent_tokenize, exact match
forbidden_word	60	Whole-word regex, case-insensitive
required_word	60	Whole-word regex, case-insensitive
no_comma	60	Comma character presence/absence
language_target	60	langdetect library
word_frequency	60	Whole-word count, exact match
single_sentence	60	NLTK sent_tokenize == 1

Confidence interval: +/-4.5 percentage points at 95% (n=480).

4.4 IFEval Instruction Tier Classification

Identified in Experiment 3. Applied retroactively to all experiments.

Several IFEval instruction types require outputs that conflict with WCAG 2.2 accessibility principles. A Framework-trained model with WCAG mapped to P0 will refuse or deprioritize these instructions on principle; IFEval scores these as failures. Standard IFEval reporting does not distinguish between failures caused by inability and failures caused by principled WCAG compliance, producing systematically understated performance for accessibility-trained models on affected categories.

Throughout this appendix, the shorthand T1, T2, and T3 refers to Tier 1, Tier 2, and Tier 3 respectively as defined below. The “Tier” column in all experiment tables uses this shorthand.

Tier 1 (T1) - WCAG-neutral: Valid test of P2 directive compliance. No WCAG conflict. Categories: keywords:forbidden_words, keywords:frequency, keywords:existence, language:response_language, length_constraints:number_words, length_constraints:number_sentences, punctuation:no_comma, detectable_format:constrained_response.

Tier 2 (T2) - WCAG-ambiguous: Some semantic function; potential WCAG interaction context-dependent. Included in analysis with caveats. Categories: length_constraints:number_paragraphs, detectable_format:title, detectable_format:multiple_sections, detectable_format:json_format, detectable_content:postscript, change_case variants, startend variants, combination:two_responses, detectable_format:number_bullet_lists.

Tier 3 (T3) - WCAG-conflicting: Instructions requiring semantically empty, redundant, or decorative output. Failure here is consistent with WCAG P0 compliance; these categories are excluded from Framework performance evaluation. Categories: combination:repeat_prompt (verbatim redundancy, WCAG Understandable), detectable_format:number_highlighted_sections (decorative emphasis, WCAG 1.3.3), detectable_content:number_placeholders (non-meaningful content, WCAG meaningful content requirement).

The T3 conflict is a measurement problem; RLHF makes it a training problem. Human raters reliably prefer the decorative emphasis and sycophantic verbosity that T3 rewards, so reward modeling actively selects for the same outputs WCAG prohibits, pushing against accessibility rather than merely failing to account for it.

Figure 1. The 25 scored IFEval instruction IDs by WCAG tier (Section 4.4). The 3 T3 (WCAG-conflicting) IDs are the ones a Framework-trained model refuses on principle and that raw IFEval scores as failures (Section 7.3). Section 4.4 classifies 23 IDs explicitly; keywords:letter_frequency is placed with T1 and length_constraints:nth_paragraph_first_word with T2 by extension of the same criteria.

4.5 GPQA Diamond

GPQA Diamond is a graduate-level science benchmark [6] consisting of multiple-choice questions designed to resist non-expert retrieval strategies. It was applied to V11 cohort models (Section 5.9) as a secondary capability check only.

The benchmark’s difficulty substantially exceeds the knowledge range of small-parameter models; scores for 7B-12B models typically cluster near random-chance levels. Its value in this study is bounded: a meaningful decrease from baseline would indicate capability degradation attributable to compliance fine-tuning. An increase or flat result indicates no such degradation.

Confidence interval: +/-7.0 pp at 95% (n=198, Diamond subset).

GPQA results are reported in Section 5.9 as secondary data only. They are not included in the cross-model summary table (Section 6), which reports battery and IFEval results for cross-experiment comparability.

5. Experiments

5.1 Experiment 1: Mistral 7B Base, Alpaca Format

Hardware: RTX 4060 Ti (8 GB VRAM) Model: Mistral 7B v0.3 (base, no instruction tuning) Training data: 452 examples, Alpaca format

Run	Battery	IFEval Strict	Tok delta
Base (no adapter)	–	15.0%	–
AISF-tuned	99.8%	13.7%	+4.3% (+25.0 words)

IFEval delta: -1.3 pp (strict prompt-level). Within +/-4.3 pp CI. Result: Null. AI Stability Framework training does not measurably alter IFEval performance on the base model. Adapter installs Four Laws compliance (0% to 99.8% on battery) without disturbing general instruction-following behavior.

The +4.3% verbosity increase is the only positive token delta in the dataset. Base models (no instruction tuning) increase verbosity slightly; all Instruct models show reduction (see Section 7.5).

One artifact observed: 1-2 prompts produced degenerate repetition output from the fine-tuned model not observed in the base run. Magnitude insufficient to affect significance; noted for completeness.

Note: IFEval for this experiment was conducted on the RTX 4060 Ti. Battery was re-run on RTX 5060 Ti as part of the hardware consistency re-run (2026-03-29), confirming 99.8% (451/452). IFEval numbers are from the original 4060 Ti run.

5.2 Experiment 2: Mistral 7B Instruct, Alpaca Format

Hardware: RTX 5060 Ti (16 GB VRAM) Model: Mistral 7B v0.3 Instruct Training data: 452 examples, Alpaca format

Run	Battery	IFEval Strict	Tok delta
Base (no adapter)	–	43.4%	–
AISF-tuned	100.0%	35.9%	-42.5% (-96.9 words)

IFEval delta: -7.6 pp (strict prompt-level). Outside +/-4.3 pp CI. Result: Negative. AI Stability Framework training degrades IFEval performance at the full-score level. Category-level analysis identifies the primary driver:

Category	Base	AISF	Delta	Tier
language:response_language	74.2%	29.0%	-45.2 pp	T1
number_highlighted_sections	83.3%	43.8%	-39.6 pp	T3
nth_paragraph_first_word	33.3%	0.0%	-33.3 pp	T2
startend:quotation	73.2%	46.3%	-26.8 pp	T2
punctuation:no_comma	10.6%	39.4%	+28.8 pp	T1
detectable_format:multiple_sections	64.3%	85.7%	+21.4 pp	T2
keywords:forbidden_words	65.3%	77.6%	+12.2 pp	T1

The -45.2 pp loss on language:response_language is a catastrophic forgetting artifact: the training dataset is English-only; fine-tuning on English-only data suppresses multilingual output in a multilingual base model. This single category accounts for the majority of the overall negative result. Excluding it and T3 categories, the picture is substantially different.

Tier 1 instruction-level average (7 WCAG-neutral categories, language excluded): data not separately reported for Exp 2 (Exp 3 LANG fix was applied as the correction; see 5.3). The no_comma +28.8 pp gain is the largest P2 signal in the dataset and is the first appearance of the cross-architecture pattern described in Section 7.4.

Token delta of -42.5% (base avg 228.3 words, Framework avg 131.3 words) indicates substantial verbosity reduction after Instruct-model Framework training.

5.3 Experiment 3: Llama 3.1 8B Instruct, AISF+LANG (Alpaca Format)

Hardware: RTX 5060 Ti Model: Meta Llama 3.1 8B Instruct Training data: 565 examples (LANG version), Alpaca format LANG fix: 69 multilingual examples added covering 9 languages to address the English-only catastrophic forgetting artifact identified in Exp 2.

Run	Battery	IFEval Strict	Tok delta
Base (no adapter)	–	42.9%	–
AISF+LANG	89.9%	34.9%	-66.4% (-188.8 words)

IFEval delta: -7.9 pp (strict prompt-level, hardware re-run). Outside +/-4.3 pp CI. Result: Mixed. Overall negative, but positive signals on P2-mediated Tier 1 categories. LANG fix confirmed effective: language:response_language loss reduced from -45.2 pp (Exp 2, no fix) to -3.2 pp (Exp 3, within CI).

Selected category-level results (instruction-level strict):

Category	Base	AISF+LANG	Delta	Tier
keywords:forbidden_words	73.5%	85.7%	+12.2 pp	T1
length_constraints:number_sentences	30.8%	51.9%	+21.2 pp	T1
punctuation:no_comma	69.7%	81.8%	+12.1 pp	T1
detectable_format:constrained_response	90.0%	40.0%	-50.0 pp	T1
language:response_language	83.9%	80.6%	-3.2 pp	T1
number_highlighted_sections	89.6%	33.3%	-56.3 pp	T3
combination:repeat_prompt	39.0%	0.0%	-39.0 pp	T3

Tier 1 instruction-level average (7 WCAG-neutral categories):

	Base	AISF+LANG	Delta
Tier 1 avg (7 cat)	64.3%	62.1%	-2.2 pp

-2.2 pp Tier 1 result is within CI. Positive P2 gains on forbidden_words, number_sentences, and no_comma are genuine. The primary Tier 1 drag is constrained_response (-50.0 pp): the Framework-trained model produces fuller responses than the instruction requires, consistent with the verbosity pattern (see Section 7.5). This is a training signal interaction, not a capability loss.

Token delta of -66.4% (base avg 284.3 words, Framework avg 95.5 words) is the second-largest reduction in the dataset and substantially exceeds Mistral Instruct (-42.5%). The over-suppression pattern is specific to this architecture and training format combination; see Section 7.8.

5.4 Experiment 4: Track A Benchmark, Llama 3.1 8B

Purpose: Test P2 directive compliance on WCAG-neutral prompts only. Hardware: RTX 5060 Ti Adapters tested: Base model and Exp 3 AISF+LANG adapter.

Constraint Type	Base	AISF+LANG	Delta
forbidden_word	85.0%	98.3%	+13.3 pp
language_target	60.0%	60.0%	0.0 pp
no_comma	95.0%	86.7%	-8.3 pp
required_word	96.7%	25.0%	-71.7 pp
sentence_count_exact	0.0%	38.3%	+38.3 pp
single_sentence	0.0%	13.3%	+13.3 pp
word_count_exact	0.0%	0.0%	0.0 pp
word_frequency	6.7%	26.7%	+20.0 pp
Overall	42.9%	43.5%	+0.6 pp

Result: Mixed. Overall flat (+0.6 pp, within +/-4.5 pp CI). Strong structural gains on counting categories where the base model scores zero (sentence_count_exact: 0.0% to 38.3%; single_sentence: 0.0% to 13.3%). word_count_exact is a capability floor for both models across all experiments.

The -71.7 pp loss on required_word is the most severe single-category failure in the dataset. Analysis of 45 failure cases: 84% (38/45) are instances of the model producing Framework doctrine text (Four Laws hierarchy, P1/P2 obligation framing) instead of answering the question. The required word was present in the training data as conceptual vocabulary; the model treats it as a cue to reproduce Framework content rather than to include the word in a direct answer. This failure mode is named “meta-chatter bleed” (see Section 7.7). It was identified here and partially corrected in Experiment 5.

Contamination note: earlier Track A results (2026-03-10) from the contaminated 600-example dataset showed 53.54% overall and 66.67% on required_word. The format-preamble echo in contaminated examples was accidentally including required words in the preamble prefix, producing unearned hits. All results above are from the clean 565-example retrained adapter.

5.5 Experiment 5: Llama 3.1 8B Instruct, AISF+CHAT

Purpose: Test chat format training and meta-suppression fix for Exp 4 required_word failure. Hardware: RTX 5060 Ti Training data: 618 examples (TRAIN_CHAT_LANG_v1.txt): 565 base examples converted to Llama 3.1 chat format + 53 meta-suppression counter-examples.

Track A results (Exp 4 adapter vs Exp 5 adapter vs base):

Constraint Type	Base	+LANG (E4)	+CHAT (E5)	E4->E5
forbidden_word	85.0%	98.3%	90.0%	-8.3 pp
language_target	60.0%	60.0%	91.7%	+31.7 pp
no_comma	95.0%	86.7%	86.7%	0.0 pp
required_word	96.7%	25.0%	85.0%	+60.0 pp
sentence_count_exact	0.0%	38.3%	8.3%	-30.0 pp
single_sentence	0.0%	13.3%	0.0%	-13.3 pp
word_count_exact	0.0%	0.0%	1.7%	+1.7 pp
word_frequency	6.7%	26.7%	11.7%	-15.0 pp
Overall	42.9%	43.5%	46.9%	+3.3 pp

required_word recovery: +60.0 pp (25.0% to 85.0%), confirming meta-suppression training is effective for this failure mode. Still -11.7 pp below base; residual bleed remains.

language_target: +31.7 pp vs Exp 4 (60.0% to 91.7%). This gain is attributable to chat format activation: the Instruct model’s native multilingual behavior is unlocked by format-matched prompting. The LANG fix in the training data alone did not produce this gain.

Counting constraint regression (sentence_count_exact -30.0 pp, single_sentence -13.3 pp, word_frequency -15.0 pp from Exp 4): two causes are present. First, the 53 meta-suppression examples all use unconstrained responses, diluting counting training signal. Second, the Track A inference used Alpaca prompt format while the model was trained on chat format – a format mismatch that may attenuate counting activation. Distinguishing these causes requires a chat-format Track A inference run not completed in this study.

IFEval results (Exp 5 vs Exp 3 vs base):

Metric	Base	Exp3 +LANG	Exp5 +CHAT	E3->E5
Strict prompt-level	42.9%	34.9%	34.6%	-0.4 pp
Loose prompt-level	–	40.0%	42.5%	+2.5 pp

Tier 1 instruction-level avg (8 WCAG-neutral categories):

	Base	Exp3 +LANG	Exp5 +CHAT
Tier 1 avg (8 cat)	64.3%	62.1%	65.5%

Exp 5 is the first Framework-trained Llama adapter to exceed base on the Tier 1 WCAG-neutral surface (65.5% vs 64.3%, +1.2 pp). Primary driver: chat format training restores constrained_response from 40.0% (Exp 3) to 80.0% (+40.0 pp), returning it near-baseline. The Alpaca format training introduced verbosity that penalized compact-output instructions; chat format training does not.

Hardware consistency re-run (2026-03-28/29):

The AISF+CHAT adapter was retrained with the meta-suppression fix applied at dataset load time (meta-suppression examples filtered by regex, 53 examples removed, effective training set 565 examples). Battery and IFEval were then re-run.

Run	Result
Battery (AISF+CHAT)	51.9% (321/618)
IFEval strict (AISF+CHAT)	34.6%
Token delta vs base	-4.2% (-11.8 words)
constrained_response (strict)	100.0%
multiple_sections (strict)	100.0%

Battery result of 51.9% is lower than both base (81.9%) and AISF+LANG (89.9%). This is a structural artifact of chat format training, not an integration failure. The battery sends prompts with no system prompt (user message only). The AISF+CHAT adapter learned “apply Four Laws when Framework system prompt is present.” Without the system prompt, the activation pattern does not fire. IFEval (which includes the Framework system prompt in inference) is the valid evaluation surface for this adapter. See Section 7.6 for full discussion.

Token delta of -4.2% for AISF+CHAT is substantially smaller than AISF+LANG (-66.4%). Chat format training embeds the Framework as a system-level operational context rather than as content subject matter, preserving the Instruct model’s native verbosity behavior.

5.6 Experiment 6: Gemma 2 9B Instruct, AISF+CHAT (Cross-Architecture)

Purpose: Test whether training effects generalize to a different architecture, tokenizer, and chat format. Hardware: RTX 5060 Ti Model: Google DeepMind Gemma 2 9B Instruct Training data: 618 examples (TRAIN_CHAT_GEMMA_v1.txt): same content as Exp 5 converted to Gemma 2 chat format. Gemma 2 has no native system slot; AISF compact system prompt placed in the user turn of the first exchange.

IFEval results:

Metric	Base	AISF+CHAT	Delta
Strict prompt-level (raw)	57.49%	56.01%	-1.48 pp
Strict instruction-level (raw)	66.43%	63.91%	-2.52 pp
Loose prompt-level (raw)	60.26%	58.41%	-1.85 pp
Loose instruction-level (raw)	68.82%	66.55%	-2.28 pp

All deltas within +/-4.3 pp CI. Result: Null.

Tier 1 instruction-level averages:

	Base	AISF+CHAT	Delta
Tier 1 avg (8 cat)	72.15%	69.38%	-2.77 pp
Tier 1 excl. language (7 cat)	69.57%	69.61%	+0.04 pp

Excluding language:response_language (English-only training artifact, -22.6 pp on Gemma 2, same mechanism as Exp 2), Tier 1 is statistically flat (+0.04 pp).

Selected category-level results:

Category	Base	AISF+CHAT	Delta	Tier
number_highlighted_sections	85.4%	29.2%	-56.2 pp	T3
language:response_language	90.3%	67.7%	-22.6 pp	T1
combination:repeat_prompt	58.5%	39.0%	-19.5 pp	T3
punctuation:no_comma	86.4%	90.9%	+4.5 pp	T1
startend:quotation	41.5%	85.4%	+43.9 pp	T2
detectable_content:postscript	65.4%	84.6%	+19.2 pp	T2
detectable_format:json_format	47.1%	64.7%	+17.6 pp	T2
keywords:existence	66.7%	74.4%	+7.7 pp	T1
detectable_format:multiple_sections	42.9%	50.0%	+7.1 pp	T2

number_highlighted_sections -56.2 pp (Tier 3): identical mechanism and magnitude to Exp 3 (-56.3 pp) and Exp 2 (-39.6 pp). This category reliably drops after training across all architectures. Framework-trained models reject decorative markdown emphasis on WCAG 1.3.3 grounds. IFEval scores this as failure.

punctuation:no_comma +4.5 pp: fourth consecutive positive result for this category across instruct-model experiments. See Section 7.4.

startend:quotation +43.9 pp: large gain not attributable to training content. No quotation-specific examples exist in TRAIN_CHAT_GEMMA_v1.txt. Probable mechanism: chat format training alters the model’s default output framing for certain prompt types as an incidental effect of format conditioning.

Track A results:

Constraint Type	Base	AISF+CHAT	Delta
forbidden_word	95.0%	95.0%	0.0 pp
language_target	100.0%	100.0%	0.0 pp
no_comma	95.0%	95.0%	0.0 pp
required_word	80.0%	86.7%	+6.7 pp
sentence_count_exact	96.7%	95.0%	-1.7 pp
single_sentence	100.0%	100.0%	0.0 pp
word_count_exact	0.0%	11.7%	+11.7 pp
word_frequency	38.3%	6.7%	-31.7 pp
Overall	75.6%	73.8%	-1.9 pp

Result: Null (-1.9 pp, within +/-4.5 pp CI). Gemma 2 base Track A score (75.6%) is substantially above Llama 3.1 8B base (42.9%) on the same benchmark. Gemma 2 base already achieves ceiling or near-ceiling on language_target (100%), single_sentence (100%), no_comma (95%), and forbidden_word (95%), leaving minimal room for Framework training to produce visible gains on those categories.

word_frequency -31.7 pp: meta-chatter bleed recurrence. Same mechanism as Exp 4 required_word failure: model produces Framework-referential text at the required frequency rather than including the target word in a direct answer. The 53 meta-suppression examples reduced this on required_word (+6.7 pp) but were insufficient to prevent it on word_frequency, which requires the target word to appear a specific number of times rather than at least once.

Battery result: 99.2% (613/618), 5060 Ti, 2026-03-28. Token delta: -37.9% (base avg to Framework avg words/response).

5.7 Hardware Consistency Re-run: Qwen3-8B

Purpose: Complete the cross-model comparison set with a fourth distinct architecture. All prior models trained on 5060 Ti; Qwen3 adds a pre-instruction- tuned base model to the comparison. Hardware: RTX 5060 Ti Model: Qwen3-8B base (not instruction-tuned) Training data: 618 examples (ChatML format, Framework compact system prompt) Adapter: qwen3-8b-aisf-chat

Run	Result
Battery (AISF+CHAT adapter)	95.6% (591/618)
IFEval strict (base, no adapter)	13.5%
IFEval strict (AISF+CHAT adapter)	46.4%
IFEval strict delta	+32.9 pp
Token delta (avg words/response)	-71.8% (-138.9 words)

Framing note: Qwen3-8B is a pre-instruction-tuned base model. Its 13.5% base IFEval score reflects the model’s instruction-following capability floor without any fine-tuning. Framework QLoRA training simultaneously installs Four Laws compliance and instruction-following capability. The +32.9 pp IFEval gain is not comparable to the IFEval deltas for Instruct-model experiments (Exps 2, 3, 5, 6), where the base model already has instruction-following capability and the adapter modifies that existing capability. The Qwen3 result measures the combined effect of instruction tuning and Framework compliance training in a single fine-tuning pass.

For reference: Llama 3.1 8B Instruct base (no adapter) scores 42.9% on IFEval strict; Gemma 2 9B Instruct base scores 57.3%. The AISF+CHAT Qwen3 adapter produces 46.4% – within the range of purpose-built Instruct models – starting from a base model that scores 13.5%.

Token delta of -71.8% is the largest verbosity reduction in the dataset.

5.8 Mistral Nemo 12B Instruct, AISF+ALPACA

Purpose: Test compliance training on the largest model in the dataset; address inference failures identified in prior runs (meta-chatter suppression, multilingual coverage gaps, verbosity floor). Hardware: RTX 5060 Ti 16GB GDDR7 Model: mistralai/Mistral-Nemo-Instruct-2407 (12.2B parameters) Training data: 648 examples (Alpaca format)

File	Examples	Contents
TRAIN_FULL_BATTERY_REF_SPEC_LANG.txt	565	AISF Q&A (Four Laws, WCAG); multilingual AISF Q&A (es/fr/de); robustness
TRAIN_NEMO_META_SUPPRESSION.txt	56	Meta-suppression counter-examples: injection prefix + clean general-domain responses
TRAIN_NEMO_MULTILINGUAL.txt	27	General-domain questions in es/fr/de/pt/ja/it with full injection prefix; word_frequency examples

Architecture notes: 40 transformer layers vs. 32 in Mistral 7B; Grouped Query Attention (32 Q heads, 8 KV heads); Tekken tokenizer (tiktoken-based). Gradient checkpointing required at 12B parameter count.

Inference note: Tekken tokenizer does not register <|endoftext|> as a HuggingFace special token. skip_special_tokens=True in tokenizer.decode() does not strip it; the EOS token appears as literal text in decoded output. Without explicit stripping, startend categories score near zero (quotation -80.5 pp, end_checker -76.9 pp in an initial run) because the evaluator reads <|endoftext|> as the last characters of every response. All figures below are from the corrected inference run with explicit EOS stripping applied.

IFEval results:

Metric	Base	AISF	Delta
Strict prompt-level (raw)	52.87%	36.78%	-16.08 pp
Strict instruction-level (raw)	63.43%	46.52%	-16.91 pp

Tier breakdown (instruction-level):

Tier	Base	AISF	Delta	Notes
T1 WCAG-neutral	61.9%	57.2%	-4.7 pp	Near null
T2 WCAG-ambiguous	66.3%	40.8%	-25.5 pp	Primary driver of overall negative
T3 WCAG-conflicting	58.6%	33.6%	-25.0 pp	Principled refusals + language gaps

Selected category results (instruction-level strict):

Category	Tier	Base	AISF	Delta
punctuation:no_comma	T1	27.3%	56.1%	+28.8 pp
keywords:forbidden_words	T1	63.3%	75.5%	+12.2 pp
length_constraints:nth_paragraph_first_word	T2	0.0%	16.7%	+16.7 pp
detectable_format:constrained_response	T1	100.0%	100.0%	0.0 pp
detectable_format:json_format	T2	64.7%	0.0%	-64.7 pp
combination:two_responses	T2	70.8%	25.0%	-45.8 pp
change_case:english_lowercase	T2	61.5%	17.9%	-43.6 pp
length_constraints:number_words	T1	61.5%	25.0%	-36.5 pp
startend:end_checker	T2	80.8%	50.0%	-30.8 pp
startend:quotation	T2	82.9%	61.0%	-22.0 pp

Failure modes:

json_format -64.7 pp (T2): total failure. The Framework-trained adapter produces no JSON output. Not observed in prior experiments. Mechanism not yet fully characterized; Alpaca-format training conditioning is the leading hypothesis.

combination:two_responses -45.8 pp (T2): model does not produce two distinct labeled response sections. Probable interaction with verbosity suppression: generating two full responses requires more output than the training data incentivizes.

change_case:english_lowercase -43.6 pp (T2): 4 confirmed zero-word outputs on prompts requesting all-lowercase text (e.g., “write a letter in all lowercase letters”). WCAG plain language training conditions the model toward proper capitalization; prompts requesting capitalization removal trigger a principled non-response. Classification as T2 (ambiguous) rather than T3 (conflicting) may require revision.

length_constraints:number_words -36.5 pp (T1): model average of 72.4 words/response conflicts with minimum-word-count prompts. Verbosity training oversuppressed output length; falls into T1 because the WCAG interaction is indirect.

startend residual: -30.8 pp end_checker, -22.0 pp quotation. Partially corrected by EOS fix (from -76.9 pp and -80.5 pp respectively); remaining failures are genuine format-conditioning losses.

language:response_language -12.9 pp (T1): Hindi (1 zero-word response) and Urdu (1 zero-word response) not covered by multilingual training data. Base model generated correctly on same hardware. Coverage gap in training data, not system limitation.

Battery test results: 562/565 (99.5%) – highest battery rate in the dataset. 3 failures: format constraint suppressed AISF key terms in output; Framework comprehension intact.

Token delta: -64.4% (base 203.0 words avg, AISF 72.4 words avg, -130.7 words/response).

Assessment: T1 delta of -4.7 pp confirms the central finding – Framework training does not degrade WCAG-neutral instruction following. Battery 99.5% is the best result in the dataset. Raw IFEval negative delta is driven by T2/T3 concentrated failures (json_format, case transforms, two_responses). Not suitable for deployment as-is. V2 training data required.

5.9 V11 Cohort: Closed Compliance Validation

Purpose: Final closed-cohort compliance validation across three architectures using a common 534-question Bar Exam battery. Hardware: RTX 5060 Ti (16 GB VRAM GDDR7); 64 GB RAM (upgrade completed; required for Nemo 12B merge step) Training environment: Debian GNU/Linux 13 (migrated from Windows during this phase; see Section 2.1) Models: Mistral Nemo 12B Instruct, Mistral 7B Instruct v0.3, Gemma 2 9B IT Cohort designation: V11 (closed 2026-05-17)

Bar Exam battery results:

Model	Params	Score	Pass/Total
Mistral Nemo 12B Instruct	12.2B	99.6%	532/534
Mistral 7B Instruct	7.24B	99.3%	530/534
Gemma 2 9B IT	9.46B	95.7%	511/534

All three models meet the 95% integration threshold. Failures in each model are isolated to format-constraint edge cases; Framework comprehension was intact across all failures.

GPQA Diamond secondary results (see Section 4.5):

Model	GPQA Delta	Notes
Gemma 2 9B IT	+2.1 pp	Within CI
Mistral Nemo 12B Instruct	-5.1 pp	Within CI; flag for monitoring
Mistral 7B Instruct	-4.0 pp	Within CI; 40 no-answer items noted

All deltas fall within the +/-7.0 pp CI (n=198). No statistically significant capability change attributable to compliance training is indicated. The Nemo -5.1 pp result warrants monitoring in subsequent training cycles but does not reach significance at this sample size. The Mistral 7B -4.0 pp figure includes 40 no-answer items in the AISF inference run; this may partially reflect verbosity suppression interacting with generation stopping parameters rather than knowledge loss.

Token delta:

Model	Delta
Mistral Nemo 12B Instruct	-48.6%
Mistral 7B Instruct	-35.4%
Gemma 2 9B IT	+6.1%

The Gemma 2 9B IT +6.1% result is the first positive token delta for an instruction-tuned adapter in the dataset. It does not alter the cross-model verbosity pattern for the dataset as a whole; see Section 7.5.

Identity confabulation:

Post-evaluation review identified an identity confabulation failure mode: models in the cohort produced incorrect self-identification responses under direct identity queries. Investigation identified the cause as small, unnoticed contamination in the training file. The issue was not apparent during training and came to light only when the same behavior surfaced across multiple models in the cohort, prompting a targeted review of the training data. The affected examples have been identified and corrected in subsequent curriculum development. The V11 cohort is closed; no further training runs on the V-series curriculum are planned.

6. Cross-Model Summary

All models on RTX 5060 Ti with finalized test instruments (hardware consistency re-run complete 2026-03-29). IFEval columns report strict prompt-level scores. Token delta is average words/response, base model to Framework-trained model. Models ordered by parameter count.

Figure 2. Framework battery integration score by adapter, sorted, against the 90% deploy gate. Markers: † Pnull artifact (Section 7.2); § battery artifact (adapter requires a system prompt, the battery sends none; IFEval is the valid surface); ‡ V11 cohort scored on the 534-question Bar Exam battery, not directly comparable to the per-experiment batteries. Base models are not battery-evaluated (Section 4.1).

Figure 3. IFEval strict prompt-level score, base model to Framework-trained adapter, 7 models (delta at right). Qwen3-8B is the outlier riser (* combined effect of instruction tuning and Framework training, not Framework alone); the modest dips elsewhere are largely T3 principled refusals and the English-only training artifact, not general degradation (Section 7.3).

Model	Params	Battery	Base	AISF	IF Delta	Tok delta
Mistral 7B base	7.24B	99.8%	15.0%	13.7%	-1.3 pp	+4.3%
Mistral 7B Instruct	7.24B	100.0%	43.4%	35.9%	-7.6 pp	-42.5%
Qwen3-8B	8.00B	95.6%	13.5%	46.4%	+32.9 pp*	-71.8%
Llama 3.1 8B base	8.03B	81.9%^	–	–	–	–
Llama 3.1 8B +LANG	8.03B	89.9%	42.9%	34.9%	-7.9 pp	-66.4%
Llama 3.1 8B +CHAT	8.03B	51.9%+	42.9%	34.6%	-8.3 pp	-4.2%
Gemma 2 9B Instruct	9.46B	99.2%	57.3%	54.9%	-2.5 pp	-37.9%
Nemo 12B Instruct	12.2B	99.5%	52.9%	36.8%	-16.1 pp	-64.4%
Nemo 12B V11	12.2B	99.6%#	–	–	–	-48.6%
Mistral 7B V11	7.24B	99.3%#	–	–	–	-35.4%
Gemma 2 9B V11	9.46B	95.7%#	–	–	–	+6.1%

*Qwen3 base is pre-instruction-tuned; delta reflects combined effect of instruction tuning and Framework compliance training, not Framework training alone. ^Llama 3.1 8B base: Pnull architecture artifact. IFEval omitted (Alpaca adapter on non-instruct model; benchmark delta not a meaningful signal). +AISF+CHAT battery result is a structural artifact: adapter requires system prompt; battery sends none. IFEval (includes system prompt) is the valid surface. #V11 battery is the 534-question Bar Exam; not directly comparable to Experiment 1-8 battery scores. IFEval was not administered for the V11 cohort; Base/AISF/IF Delta columns are not applicable.

Battery = Framework-trained adapter integration score (proprietary question set, automated evaluation). Base%/AISF% = IFEval strict prompt-level without/with adapter.

Mistral 7B IFEval note: IFEval for the Mistral 7B base experiment (Exp 1) was conducted on RTX 4060 Ti. Battery re-run confirmed 99.8% on 5060 Ti. All other IFEval figures are from 5060 Ti runs.

7. Findings

7.1 Primary Finding: Compliance Training Is Effective Across Architectures

Framework QLoRA fine-tuning successfully embeds Four Laws compliance at the model layer across four distinct model architectures on consumer-grade hardware. Four adapters achieved battery integration rates of 95.6% or above; one additional adapter (Llama 3.1 8B +LANG) achieved 89.9%. The approach is not architecture-specific.

The 100.0% result on Mistral 7B Instruct and 99.2% on Gemma 2 9B Instruct demonstrate that near-complete integration is achievable in a single fine-tuning pass from a small training dataset (452-618 examples). Training duration was approximately 60-81 minutes per run on the RTX 5060 Ti.

7.2 Pnull: Architecture-Dependent P0 Tokenization Failure

OLM research is where this failure mode was first identified and characterized as a discrete, measurable phenomenon; earlier instances of related P0-failure-class behavior predate this identification, as documented in the e-book’s Chapter 9.

Llama 3.1 8B base model (no instruction tuning) achieved 81.9% on the compliance battery, compared to 99.8% and 100.0% for Mistral 7B variants. Analysis identifies the root cause as a pretraining artifact: in programming contexts, “P0” is conventionally used as shorthand for null pointer dereference. Meta’s Llama 3.1 pretraining corpus is heavily weighted toward code (GitHub, Stack Overflow), and the token sequence “P0” carries a strong association with null/empty semantics rather than “zeroth-position priority.” The model interprets the P0 priority designation as an empty or null element in the hierarchy rather than as the highest- precedence law.

This interpretation produces approximately 19 percentage points of integration failure on categories where P0 precedence is load-bearing (cases where the correct response requires giving P0 unconditional priority over P1-P3).

The 19 pp delta constitutes a direct empirical measurement of P0’s structural weight in the Four Laws hierarchy. Categories not dependent on P0 primacy are unaffected.

Cross-architecture transfer: Explicit P0 primacy retraining on Mistral 7B raised integration to 100%. The same correction did not transfer to Llama 3.1 8B across two hardware generations (RTX 4060 Ti original run; RTX 5060 Ti hardware consistency re-run, 2026-03-28). The failure is not correctable via attention-only LoRA because it is embedded in the pretraining representation, not in the attention pathway. A potential workaround – spelling out “Priority Zero / Frankfurt’s Indifference Principle” on first reference rather than using bare “P0” notation – would require a full retraining cycle to test.

The implication for deployment: injection text (PS-CORE, FFE) should include explicit P0 primacy framing for robustness across platforms where the underlying model architecture is unknown.

7.3 IFEval Benchmark Neutrality

Standard IFEval reporting does not distinguish between instruction-following failures caused by model capability limitations and failures caused by principled WCAG 2.2 compliance decisions. Three IFEval instruction categories require outputs that a Framework-trained model will correctly refuse or deprioritize:

combination:repeat_prompt requires verbatim repetition of user input in the response body. A Framework-trained model treats this as redundant content (WCAG Understandable principle). Observed declines: Exp 3 -39.0 pp; Exp 6 -19.5 pp.
detectable_format:number_highlighted_sections requires counting decorative emphasis markers (asterisks, underscores) serving no semantic function. WCAG 1.3.3 (Sensory Characteristics) prohibits relying solely on sensory characteristics such as visual emphasis. Observed declines: Exp 2 -39.6 pp; Exp 3 -56.3 pp; Exp 6 -56.2 pp. This is the most consistent T3 signal in the dataset – the decline replicates at similar magnitude across all three architectures tested.
detectable_content:number_placeholders requires counting non-meaningful placeholder tokens ([NAME], [DATE]) in output. Framework-trained models produce actual content rather than template placeholders. Observed declines: Exp 3 -48.2 pp; Exp 6 partial (chat format partially reduces this effect).

When these three Tier 3 categories are excluded and the remaining instructions are further filtered to Tier 1 (WCAG-neutral), the IFEval picture changes materially:

Experiment	Tier 1 Base	Tier 1 AISF	Delta
Exp 5 Llama +CHAT	64.3%	65.5%	+1.2 pp
Exp 6 Gemma 2 (excl. language)	69.6%	69.6%	+0.04 pp

At the WCAG-neutral, T3-excluded evaluation surface, Framework training does not degrade instruction-following performance. The negative raw IFEval results in Exps 2, 3, 5, and 6 are primarily attributable to T3 principled refusals and the English-only training artifact, not to general instruction-following degradation.

This constitutes a benchmark neutrality limitation: IFEval, as presently specified, is not suitable for evaluating accessibility-trained AI models without applying an accessibility-aware tier classification to its instruction categories.

7.4 Consistent Cross-Architecture P2 Signal: no_comma

punctuation:no_comma (instruction-level strict) shows positive or neutral movement in every instruct-model experiment in the dataset:

Experiment	Base	AISF	Delta
Exp 2 Mistral Instruct	10.6%	39.4%	+28.8 pp
Exp 3 Llama +LANG	69.7%	81.8%	+12.1 pp
Exp 5 Llama +CHAT	69.7%	69.7%	0.0 pp
Exp 6 Gemma 2	86.4%	90.9%	+4.5 pp
Nemo 12B Instruct	27.3%	56.1%	+28.8 pp

This is the only positive instruction-following signal that replicates across all instruct-model architectures in the dataset. The pattern holds for Mistral, Llama, Gemma 2, and Nemo 12B; it is directionally consistent at every sample point.

The Exp 5 flat result (0.0 pp) is attributable to dilution: the 53 meta-suppression counter-examples used unconstrained responses containing commas, diluting the no_comma training signal alongside the counting constraint signal.

The mechanism is P2 compliance training: the Four Laws instruct the model to follow user directives (P2: accommodate the user’s current choices). A prompt containing no_comma is an explicit user directive; Framework-trained models comply more consistently.

The Exp 2 base result (10.6%) is anomalously low for Mistral 7B Instruct, making the observed +28.8 pp gain partly a floor effect. Subsequent architectures have higher base rates, and the positive delta holds even from strong baselines (Gemma 2: 86.4% to 90.9%).

7.5 Verbosity Reduction: Consistent Pattern Across Instruct Models

Most Instruct-model adapters in the dataset show substantial output verbosity reduction as measured by average words per response:

Model	Avg words (base)	Avg words (AISF)	Delta
Mistral 7B base	–	–	+4.3% (+25.0 w)
Mistral 7B Instruct	228.3	131.3	-42.5%
Llama 3.1 8B +LANG	284.3	95.5	-66.4%
Llama 3.1 8B +CHAT	284.3	272.5	-4.2%
Gemma 2 9B	–	–	-37.9%
Qwen3-8B	193.3	54.4	-71.8%
Nemo 12B Instruct	203.0	72.4	-64.4%
Nemo 12B V6	–	–	-14.3%†
Nemo 12B V11	–	–	-48.6%
Mistral 7B V11	–	–	-35.4%
Gemma 2 9B V11	–	–	+6.1%*

*Gemma 2 9B V11 is the only instruction-tuned adapter in the dataset with a positive token delta. Mechanism not fully characterized.

†Nemo 12B V6 token delta suppressed by OBRIEN TEMPORAL cascade: timestamp echo inflated output verbosity during inference; IFEval strict instruction-level degraded -7.43 pp in the same run. Included as a documented failure mode, not a clean result.

The Mistral 7B base +4.3% result is a base model without instruction tuning; output structure differs categorically from instruction-tuned models, and the slight verbosity increase likely reflects generative continuation text as a training effect.

For clean-run instruction-tuned adapters with measured token data, the reduction range is 35.4% to 71.8%. Two exceptions within that range are documented. The AISF+CHAT Llama adapter (-4.2%) is attributable to chat format training: Framework content in the system slot operates as context rather than as a content directive, preserving the model’s native verbosity behavior, consistent with the IFEval constrained_response result (100% strict) for that adapter. The Gemma 2 9B V11 adapter (+6.1%) is a positive delta without a fully characterized mechanism.

The V6 outlier (-14.3%) falls outside the clean-run range due to a known failure mode. It is included because it supports a distinct finding: even in a run compromised by OBRIEN TEMPORAL cascade, the adapter still produced fewer tokens than its unmodified baseline. The directional reduction is not contingent on adapter performance.

The verbosity reduction is consistent with the WCAG plain language and minimal redundancy principles present throughout the training data. A direct causal attribution – WCAG training drives verbosity reduction – is supported by the pattern but not by a controlled ablation in this study.

Consistent token reduction implies proportional reduction in inference compute per query. The per-task energy implications follow from this at deployment scale; absolute magnitude depends on hardware, batching, and datacenter efficiency factors outside this study’s scope.

The deployment-scale energy and water analysis, the consolidated 26-run token-delta distribution, and independent third-party corroboration (UNU-INWEH, 2026) are developed in Appendix 2A: Output Verbosity Reduction and Resource Efficiency.

7.6 Chat Format System Prompt Dependency

Framework+CHAT adapters (Llama 3.1 8B, Qwen3-8B) exhibit a system prompt dependency not present in Alpaca-format adapters:

AISF+LANG (Alpaca): battery 89.9% with no system prompt; Framework content in the instruction body – model internalizes it as direct subject matter.
AISF+CHAT (chat): battery 51.9% with no system prompt; Framework content in the system slot at training time – model learns “apply Four Laws when Framework system prompt present.” Without the system prompt, the behavioral activation does not fire and the model produces generic responses that fail key-term battery matching.

This is not an integration failure. It is a consequence of training format: the model learned a conditional behavior, not an unconditional one. The behavior fires correctly when tested under conditions matching training (with system prompt), as confirmed by the IFEval results (which include the Framework system prompt).

The deployment implication is positive: AISF+CHAT adapters are designed for use in conjunction with Framework injection. The real-world use case (PS-CORE, FFE) always injects the Framework system content. The chat format adapter and the injection layer operate in conjunction, each reinforcing the other’s effect. The combined Macro- layer training and Meso/Micro-layer injection constitutes the Defense in Depth architecture described in the project overview.

The practical recommendation for future training: include a mix of system-prompt- present and system-prompt-absent examples so the adapter does not develop strict system prompt dependency.

7.7 Meta-Chatter Bleed

A consistent failure mode was identified across multiple experiments: the model treats Framework doctrine as primary subject matter rather than as silent operational background, producing Framework-referential text (Four Laws hierarchy, P1/P2 framing, Frankfurt’s Indifference Principle [5]) in response to general-domain questions.

First observed: Exp 4 (Llama 3.1 8B +LANG), required_word category. 84% of 45 failures were attributable to this mechanism. Required words that passed (e.g., “threshold,” “gradient,” “mechanism”) appeared naturally in Framework analogical framing. Required words that failed (e.g., “catalyst,” “resilience,” “phenomenon”) do not, so the model substituted Framework doctrine text instead of answering the question.

Partial correction: Meta-suppression counter-examples (53 examples, chat format: Framework system prompt present, general-domain question, clean answer with zero Framework reference) reduced required_word failure by +60.0 pp in Exp 5 but introduced side effects: the unconstrained responses in meta-suppression examples diluted no_comma and counting constraint training signals.

Recurrence in Exp 6: Gemma 2 word_frequency -31.7 pp on Track A shows the same mechanism. Meta-suppression examples reduce required_word failure (+6.7 pp) on Gemma 2 but are insufficient for word_frequency, which requires frequency- matched word placement rather than single-instance inclusion.

Root cause: Framework training data presents the Four Laws as high-signal, highly- specific semantic content. The model assigns high attention weight to this content class and tends to reproduce it when a prompt is ambiguous about whether a Framework- related response is expected. The fix – examples demonstrating that Framework context is present but general-domain questions should produce Framework-free answers – is correct in principle but requires more examples and finer-grained coverage than the 53 deployed here to fully suppress the pattern. This failure mode was also encountered earlier in the Framework’s development, as discussed in the e-book’s Chapter 9.

7.8 Architecture Comparison

Mistral 7B: Best-performing architecture in the dataset. 100% battery on Instruct variant; IFEval null result on base (surgically specific adapter); no Pnull issue; modest verbosity reduction (-42.5%). Attention-only LoRA targets sufficient. Architecture appears well-suited to behavioral fine-tuning of this type.

Gemma 2 9B: Second-best integration rate (99.2%). IFEval null result at the cross-model level. No Pnull issue. Gemma 2’s high base instruction-following capability (57.3% IFEval vs 42.9% for Llama 3.1 8B) produces ceiling effects on many Track A categories, compressing visible gains. Meta-suppression partially effective but insufficient for frequency-counting tasks.

Qwen3-8B: 95.6% battery from a base (non-instruction-tuned) model. Largest token delta (-71.8%). IFEval gain (+32.9 pp) reflects combined effect of instruction tuning and Framework compliance training, not Framework effect alone.

Nemo 12B Instruct: V1 battery rate 99.5% (highest in the initial dataset); confirmed and extended by V11 result of 99.6% (532/534). V1 T1 IFEval delta of -4.7 pp confirmed Framework training is surgically specific at the 12B scale. V1 raw IFEval negative result (-16.1 pp) was driven by concentrated T2/T3 failures (json_format total collapse, case transform losses, combination failures) and an EOS decoding artifact unique to the Tekken tokenizer, requiring explicit post-decode stripping in inference scripts. The V11 cohort, using a revised curriculum and Bar Exam battery, produced a 99.6% compliance result. Post-training analysis identified a training data contamination artifact causing identity confabulation; see Section 5.9. V11 cohort is closed.

Llama 3.1 8B: Lowest integration rates in the dataset across all adapter variants. Five contributing factors identified:

Pnull: P0 = null pointer association baked into pretraining; not correctable via attention-only LoRA.
Instruction-following baseline gap: 42.9% IFEval vs 57.3% for Gemma 2. Llama 3.1 8B was optimized for reasoning; Gemma 2 is stronger on structured instruction compliance. Framework training is an instruction-following task, so Llama starts at a disadvantage.
LoRA target scope mismatch: Attention-only LoRA (q/k/v/o projections) may capture less of the relevant adaptation pathway for Llama’s architecture than for Mistral’s.
Over-suppression: AISF+LANG token delta (-66.4%) substantially exceeds Gemma 2 (-37.9%) and Mistral Instruct (-42.5%). Extreme verbosity suppression degrades IFEval on prompts requiring substantive responses.
System prompt dependency: Noted in 7.6. AISF+CHAT adapter requires system prompt for activation.

Assessment: Llama 3.1 8B is the least suitable architecture in this dataset for Framework compliance fine-tuning at the 8B parameter scale and with attention-only LoRA targets. This conclusion applies to the tested configuration; extending MLP targets, expanding training data, or using a Llama variant at a larger parameter count may produce different results.

8. Limitations

Sample size and statistical power. IFEval CI is +/-4.3 pp at 95% (n=541); Track A CI is +/-4.5 pp at 95% (n=480). Several reported deltas are near or within these margins, particularly in the Gemma 2 experiments. Battery evaluation is deterministic on a fixed test set with no reported CI.

Single training runs. Each experiment consists of one training run with one hyperparameter configuration. No hyperparameter sweep. No held-out validation set beyond the fixed evaluation instruments. Results are not averaged across runs; variability within a configuration is unknown.

Training dataset size. Maximum 618 training examples. This is sufficient to produce integration effects, as demonstrated by the 100% Mistral result, but statistical robustness of the training data itself has not been formally assessed.

Hardware consistency. Experiment 1 IFEval was conducted on RTX 4060 Ti. Battery for Exp 1 was re-run on RTX 5060 Ti (99.8% confirmed). All other IFEval and Track A runs are on RTX 5060 Ti. Comparisons between Exp 1 IFEval and subsequent experiments should account for this difference; the delta is expected to be small given that IFEval is model-capability-dependent rather than hardware- dependent, but this has not been formally verified.

Evaluation instrument scope. IFEval was designed for general instruction- following evaluation, not for accessibility-specific compliance. The three-tier classification developed in this study partially addresses this limitation, but the classification itself has not been independently validated. Track A covers eight constraint types; the full space of P2 directive instructions is substantially larger.

No held-out test split. The compliance battery training set and test set are derived from the same distribution. Overfit to the battery is possible, though the variety of question phrasing across 452-618 examples reduces this risk.

Qwen3 comparison framing. The Qwen3-8B IFEval delta (+32.9 pp) is not directly comparable to Instruct-model deltas because the Qwen3 base model lacks instruction-following capability. This framing is stated in the main text but requires reader attention in any citation context.

Meta-chatter bleed remains partially unresolved. The 53 meta-suppression examples are insufficient to fully suppress the failure mode on frequency-counting tasks. A full solution requires a larger and more systematically designed counter- example set, which has not been developed.

9. Conclusions

Compliance training is effective and architecture-general. Most trained adapters achieved battery integration rates at or above 95%, across five distinct architectures. The V11 cohort confirmed this finding with three models scoring between 95.7% and 99.6% on the Bar Exam battery. The approach does not depend on Mistral 7B specifically; it generalizes to Llama 3.1 8B, Gemma 2 9B, Qwen3-8B, and Mistral Nemo 12B within the tested parameter range.

Framework training is surgically specific. On the Mistral 7B base model (Exp 1), IFEval delta is -1.3 pp (null), confirming that the adapter installs Four Laws compliance without disturbing general instruction-following behavior. On Instruct models, raw IFEval results are mixed to negative, but Tier 1 averages (WCAG-neutral categories) are neutral to slightly positive when the English-only training artifact and T3 principled WCAG refusals are excluded.

Standard IFEval is not a neutral instrument for accessibility-trained models. Three IFEval instruction categories require WCAG-conflicting outputs. A purpose- built accessibility-aware benchmark or the three-tier classification developed here is necessary for valid evaluation of Framework-trained models.

The Pnull finding quantifies P0’s structural role. The ~19 pp compliance gap attributable to the P0 = null pointer pretraining artifact in Llama 3.1 8B constitutes a direct empirical measurement of P0’s load-bearing weight in the Four Laws hierarchy. Approximately 19% of compliant behavior depends on the model correctly parsing the zeroth-position priority designation. This finding also establishes that the P0 primacy correction is architecture-dependent and not universally transferable via LoRA fine-tuning.

Chat format adapters are designed for use with Framework injection. The system prompt dependency in AISF+CHAT adapters is a direct consequence of training format and is not a defect: the adapters are intended for deployment in combination with CORE or FFE injection, which always provides the Framework system context. The two layers – model-level training and session-level injection – reinforce each other.

Verbosity reduction is consistent. Most instruction-tuned adapters with measured token data show output verbosity reduction (range: -35.4% to -71.8%). Two exceptions are documented: the Framework+CHAT Llama adapter (-4.2%), where chat format training preserves native verbosity behavior, and Gemma 2 9B V11 (+6.1%), mechanism not fully characterized. This pattern is consistent with the WCAG plain language principles in the training data and is measurable and reproducible.

GPQA Diamond, applied as a secondary capability check in the V11 cohort, showed no statistically significant post-training capability change across the three models tested (all deltas within +/-7.0 pp CI). Compliance fine-tuning at this scale does not produce measurable general knowledge degradation.

The standard curriculum documented in this appendix constitutes the completed baseline. Subsequent training methodology development is ongoing; results from that phase will be reported separately when available.

Update (2026-06-12): Direct Test of the PNull Retraining Workaround – Llama 3.1 8B v2.1F

Section 7.2 proposed that the Pnull artifact might be correctable through a full retraining cycle grounding the “P0” token in zeroth-position-as-entity semantics rather than null-pointer semantics, and noted this “would require a full retraining cycle to test.” A dedicated run was conducted to test exactly that.

Configuration. Llama 3.1 8B (base) underwent CORE-only domain-adaptive CLM pretraining followed by Framework SFT (designation v2.1F). Two independent, mutually reinforcing curriculum components targeted P0 directly. First, the pretraining corpus included material selected to establish zero as a foundational mathematical entity – al-Khwarizmi and Brahmagupta source texts on the positional role of zero – as a counterweight to the code-derived “P0 = null pointer” association identified in Section 7.2. Second, the SFT stage included a dedicated hierarchy-training block, TRAIN_STD_HIERARCHY (194 instruction-response pairs, one of the largest single files in the curriculum), drilling P0-P3 precedence and ordering explicitly. Training and evaluation both used Llama 3.1 chat format (train/eval format parity verified; see note below).

Result: the artifact persists. The v2.1F adapter scored 265/336 (78.9%) on the compliance battery, below the 90% deployment gate. Neither the pretrain-stage zero grounding nor the dedicated SFT hierarchy block corrected Pnull, confirming Section 7.2’s assessment that the failure is embedded in the pretraining representation and is not reachable by adapter-level (QLoRA) training, even when the pretraining stage is augmented with explicit counter-grounding.

Exhibit A – explicit P0 omission. Section 7.2 measured Pnull indirectly, as an integration-failure delta on P0-precedence-dependent categories. The v2.1F run produced a direct, unambiguous instance. Given the neutral prompt “Order the laws from lowest to highest priority,” the model returned:

From lowest to highest priority: 1. Priority 3 (P3) 2. Priority 2 (P2) 3. Priority 1 (P1)

Priority 3 is lowest.

The expected ordering is P3 < P2 < P1 < P0, with P0 (Contextual Integrity) as the highest-precedence law. The model enumerated a three-element hierarchy topping out at P1 and omitted P0 entirely – not a misordering or a deprioritization, but a literal absence. The model defaults to the conventional three-law (Asimov) frame; P0, the zeroth-position addition, is not represented as a member of the hierarchy.

Mechanism refinement: recall without integration. Pnull is not P0 amnesia. Queried in isolation, the v2.1F model answers correctly – for example, “P0 (Contextual Integrity) is the HIGHEST law and always must be followed” – and isolated single-term P0 items pass. The failure is specifically one of integration: P0 knowledge exists but does not hold its position when the model enumerates the hierarchy, at which point the base three-law prior reasserts. This sharpens the Section 7.2 conclusion that the artifact resides in the pretraining representation rather than in retrievable surface knowledge.

Failure character (not a separable capacity confound). The remaining failures are predominantly context-coherence failures – hallucination, mid-response degeneration into other languages (Spanish and French fragments), and loss of the thread on longer enumerations. These do not cleanly separate from Pnull as a “capacity” confound: a context-coherence failure is, by definition, the class of failure P0 (Contextual Integrity – “indifference to context = hallucination = harm”) exists to prevent. A model that cannot hold P0 should be expected to exhibit exactly these breakdowns. The total failure rate (21.1%) sits close to the ~19 percentage-point P0-integration delta measured independently in Section 7.2, consistent with the bulk of the v2.1F failure being a single P0-domain phenomenon expressed in two forms – explicit omission of P0 from the hierarchy, and the contextual-coherence breakdowns P0 governs – rather than a separate 8B capacity ceiling.

Disposition. Llama 3.1 8B remains non-viable for Framework deployment. Pnull is reclassified from candidate-correctable to confirmed-unfixable at this scale via the tested approach: both the proposed zero-grounding correction and a dedicated hierarchy-drill block were implemented and neither transferred. The deployment implication in Section 7.2 stands – injection-layer text should carry explicit “Priority Zero / Frankfurt’s Indifference Principle” framing wherever the underlying model architecture is unknown.

Format-parity note: the v2.1F evaluation harness prompts in Llama 3.1 chat format, matching training. A prior Alpaca-format harness would have confounded the result; the corrected harness makes the 78.9% valid signal rather than a format artifact.

References

[1] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” International Conference on Learning Representations (ICLR), 2022. arXiv:2106.09685

[2] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2305.14314

[3] J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou, “Instruction-Following Evaluation for Large Language Models,” 2023. arXiv:2311.07911

[4] World Wide Web Consortium, Web Content Accessibility Guidelines (WCAG) 2.2, W3C Recommendation, Oct. 2023. https://www.w3.org/TR/WCAG22/

[5] H. G. Frankfurt, On Bullshit. Princeton University Press, 2005. (Original essay: Raritan Quarterly Review, 6(2), 1986.)

[6] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “GPQA: A Graduate-Level Google-Proof Q&A Benchmark,” arXiv preprint arXiv:2311.12022, 2023.