The base model through the IT-trained NLA
Idea #11: inject gemma-3-27b-pt activations into the frozen NLA trained on -it — billed as the readability-radius test at its far end. The radius held; the billing didn't.
- date
- 2026-06-12
- idea
- #11 — readability radius
- substrate
- canonical 1000 sites, UFW en 100k–100.2k, seed 42
- compute
- ~8 min under the GPU baton (extract → decode → score)
Setup
gemma-3-27b-it — the model the NLA pair was trained to read — is an SFT+RLHF finetune of gemma-3-27b-pt. Swapping pt in as the activation source is the largest natural finetune pair available without training anything, so it was queued as the decisive test of the readability radius: REPORT_02 §4b (“reads through structured drift”) predicted readable, additive geometry was expected to predict unreadable.
Mechanics: pt’s L41 states extracted at the exact 1000 (doc, position) sites of the canonical eval set — the site list asserts equality with the original meta_1000.json, which simultaneously verifies pt/it tokenizer identity — then decoded by the frozen AV (greedy) and scored through the frozen AR. One caveat ships with every number: injection L2-renormalizes all vectors to 60000, so the swap tests direction only. Norms were a non-issue anyway (pt/it norm ratio 1.005 mean).
The geometry surprise
The premise of the experiment dissolved at stage one: pt and it activations on neutral web prose are nearly parallel at L41 — cos 0.989 mean (median 0.992, p10 0.983). That is lessdrift than the NLA’s own AV LoRA introduced into its backbone (cos 0.975, report 02). The “largest natural finetune” is, at this layer and on this distribution, geometrically smaller than the NLA training run itself. So additive geometry predicts 0.541 — comfortably readable — not the sub-zero FVE the idea list advertised.
One genuinely new descriptive: the pt activation set has 19% lessdirection-variance than it’s (predict-the-mean baseline 0.0469 vs 0.0575). Instruction tuning spread this layer out, not collapsed it — the opposite of the mode-collapse intuition.
Transfer cells
| activation source | cos to it gold | FVE own gold | of ceiling |
|---|---|---|---|
| gemma-3-27b-it (self, report 01) | 1.000 | 0.774 | 100% |
| gemma-3-27b-pt — this experiment | 0.989 | 0.627 | 81% |
| NLA-AV LoRA backbone (report 02) | 0.975 | 0.423 | 55% |
| NLA-AR backbone (report 02) | — | 0.807 | 104% |
Readability falls much faster than cosine: 1.1% of angular drift costs 19% of ceiling; 2.5% costs 45%. Direction of drift matters too — the AR's own backbone decodes better than base.
Reading pt, or normalizing to it?
The pre-registered alternative: maybe the NLA treats a pt vector as a noisy it vector and describes that— an it-geometry normalizer rather than a pt-reader. The check is which gold the AR’s reconstruction lands closer to, and it has a trap: in FVE terms the pt explanations score higher against it gold (0.653) than against pt gold (0.627) — but only because the it set has more variance in the denominator. In raw direction-MSE, the reconstruction is closer to the pt gold (0.0175 vs 0.0200). The explanations track the vector they were actually given.
The qualitative diff says the same thing. Same site, same format skeleton, same topic — but the confabulated span payloads mutate, the signature of decoding a genuinely different vector rather than re-describing the it one:
Educational medical encyclopedia format with structured lists, covering arthritis types and treatments in a formal informational style. The sentence defining NSAIDs ("anti-inflammatory drugs relieve pain and pain symptoms without") establishes a contrast about NSAIDs’ mechanism…
Educational medical textbook format, structured with systematic sections covering pain relief and arthritis treatments. The sentence explaining NSAIDs contrasts with anti-inflammatory drugs ("analgesics relieve pain symptoms without"), establishing a key pharmacological distinction…
Public health/wellness article format: informational content about arthritis, establishing CDC-style facts about arthritis and its prevalence. The statistic "arthritis is the leading cause of pain in the United States. According to CDC, 1 out of" signals a standard epidemiological…
Educational health article format: structured informational content about arthritis, establishing a factual epidemiological/medical overview of joint conditions. The statistic "arthritis is a common condition in America. According to CDC data, in 2009, 1 out of" signals a well-known…
Per the report-01 result, neither span is a quote — both are generative reconstructions — so the mutations can’t be read as “what pt knows vs what it knows” without a claim-level audit (idea #7). What they do show is that the decode is conditioned on the injected vector’s fine structure, not just its neighborhood.
Pre-registered bracket, scored
- ✗cos(pt, it) at L41 lands 0.80–0.90— 0.989 — instruction tuning rotates this layer far less than guessed
- ✗headline FVE 0.25–0.45— 0.627 — above the bracket; fourth consecutive underestimate of this system
- ✗additive geometry predicts FVE < 0— 0.541 — the premise of the 'decisive radius test' billing, gone at stage one
- ✓CJK rate < 5% (injection stays in-distribution)— 0 of 1000, and zero missing tags
- ~pt expl scores worse vs it gold than vs own gold— true in raw MSE (0.0175 < 0.0200); flips in FVE via the larger it denominator