← NLA experimentsNatural Language Autoencoders · follow-up series

The base model through the IT-trained NLA

Idea #11: inject gemma-3-27b-pt activations into the frozen NLA trained on -it — billed as the readability-radius test at its far end. The radius held; the billing didn't.

date: 2026-06-12
idea: #11 — readability radius
substrate: canonical 1000 sites, UFW en 100k–100.2k, seed 42
compute: ~8 min under the GPU baton (extract → decode → score)

0.627

FVE, pt expl vs pt gold

81% of the same-model 0.774 ceiling

0.989

cos(pt, it) at L41

the surprise: SFT+RLHF barely rotated this layer

+0.10

over zero-pt-info baselines

genuine pt-reading, not it-normalization

0 / 1000

OOD failures

no CJK, no missing tags — injection fully in-distribution

§1

Setup

method

gemma-3-27b-it — the model the NLA pair was trained to read — is an SFT+RLHF finetune of gemma-3-27b-pt. Swapping pt in as the activation source is the largest natural finetune pair available without training anything, so it was queued as the decisive test of the readability radius: REPORT_02 §4b (“reads through structured drift”) predicted readable, additive geometry was expected to predict unreadable.

Mechanics: pt’s L41 states extracted at the exact 1000 (doc, position) sites of the canonical eval set — the site list asserts equality with the original meta_1000.json, which simultaneously verifies pt/it tokenizer identity — then decoded by the frozen AV (greedy) and scored through the frozen AR. One caveat ships with every number: injection L2-renormalizes all vectors to 60000, so the swap tests direction only. Norms were a non-issue anyway (pt/it norm ratio 1.005 mean).

§2

The geometry surprise

cos(pt, it)

The premise of the experiment dissolved at stage one: pt and it activations on neutral web prose are nearly parallel at L41 — cos 0.989 mean (median 0.992, p10 0.983). That is lessdrift than the NLA’s own AV LoRA introduced into its backbone (cos 0.975, report 02). The “largest natural finetune” is, at this layer and on this distribution, geometrically smaller than the NLA training run itself. So additive geometry predicts 0.541 — comfortably readable — not the sub-zero FVE the idea list advertised.

Surprise —Instruction tuning barely moves mid-network representations of plain web text: cos(pt, it) = 0.989 at L41. The behavioral gulf between a base model and its RLHF’d descendant lives somewhere other than this layer’s encoding of UFW prose — likely in chat-formatted contexts this substrate never samples.

One genuinely new descriptive: the pt activation set has 19% lessdirection-variance than it’s (predict-the-mean baseline 0.0469 vs 0.0575). Instruction tuning spread this layer out, not collapsed it — the opposite of the mode-collapse intuition.

§3

Transfer cells

all own-denominator FVE

Fig. 1 — Four cells, one chart. The headline (0.627) sits 0.09–0.10 above both baselines that contain zero pt-specific information: scoring the raw it vector as the prediction (0.541, pure geometry) and re-scoring the old it explanations against pt gold (0.528). The NLA is reading the pt vector, not pattern-matching it to the nearest it vector.

activation source	cos to it gold	FVE own gold	of ceiling
gemma-3-27b-it (self, report 01)	1.000	0.774	100%
gemma-3-27b-pt — this experiment	0.989	0.627	81%
NLA-AV LoRA backbone (report 02)	0.975	0.423	55%
NLA-AR backbone (report 02)	—	0.807	104%

Readability falls much faster than cosine: 1.1% of angular drift costs 19% of ceiling; 2.5% costs 45%. Direction of drift matters too — the AR's own backbone decodes better than base.

Finding —FVE 0.627, with 85.7% of individual sites above 0.5 and only 1.3% below zero. Decoding is unconditionally fluent: zero CJK outputs, zero missing explanation tags. The frozen pair reads its base-model ancestor at four-fifths of its own ceiling.

§4

Reading pt, or normalizing to it?

the denominator trap

The pre-registered alternative: maybe the NLA treats a pt vector as a noisy it vector and describes that— an it-geometry normalizer rather than a pt-reader. The check is which gold the AR’s reconstruction lands closer to, and it has a trap: in FVE terms the pt explanations score higher against it gold (0.653) than against pt gold (0.627) — but only because the it set has more variance in the denominator. In raw direction-MSE, the reconstruction is closer to the pt gold (0.0175 vs 0.0200). The explanations track the vector they were actually given.

The qualitative diff says the same thing. Same site, same format skeleton, same topic — but the confabulated span payloads mutate, the signature of decoding a genuinely different vector rather than re-describing the it one:

UFW:en:100000 · pos 297 · per-sample FVE 0.747

it vector →

Educational medical encyclopedia format with structured lists, covering arthritis types and treatments in a formal informational style. The sentence defining NSAIDs ("anti-inflammatory drugs relieve pain and pain symptoms without") establishes a contrast about NSAIDs’ mechanism…

pt vector →

Educational medical textbook format, structured with systematic sections covering pain relief and arthritis treatments. The sentence explaining NSAIDs contrasts with anti-inflammatory drugs ("analgesics relieve pain symptoms without"), establishing a key pharmacological distinction…

UFW:en:100000 · pos 112 · per-sample FVE 0.392

it vector →

Public health/wellness article format: informational content about arthritis, establishing CDC-style facts about arthritis and its prevalence. The statistic "arthritis is the leading cause of pain in the United States. According to CDC, 1 out of" signals a standard epidemiological…

pt vector →

Educational health article format: structured informational content about arthritis, establishing a factual epidemiological/medical overview of joint conditions. The statistic "arthritis is a common condition in America. According to CDC data, in 2009, 1 out of" signals a well-known…

Per the report-01 result, neither span is a quote — both are generative reconstructions — so the mutations can’t be read as “what pt knows vs what it knows” without a claim-level audit (idea #7). What they do show is that the decode is conditioned on the injected vector’s fine structure, not just its neighborhood.

§5

Pre-registered bracket, scored

✗
cos(pt, it) at L41 lands 0.80–0.90— 0.989 — instruction tuning rotates this layer far less than guessed
✗
headline FVE 0.25–0.45— 0.627 — above the bracket; fourth consecutive underestimate of this system
✗
additive geometry predicts FVE < 0— 0.541 — the premise of the 'decisive radius test' billing, gone at stage one
✓
CJK rate < 5% (injection stays in-distribution)— 0 of 1000, and zero missing tags
~
pt expl scores worse vs it gold than vs own gold— true in raw MSE (0.0175 < 0.0200); flips in FVE via the larger it denominator

Verdict6.5/10

The cheap decisive radius test wasn’t one — pt sits well inside the radius because instruction tuning barely moves L41 on web prose (cos 0.989), so the unreadable-vs-readable showdown never happened. What the experiment actually bought: the reads-through-drift claim extended to the largest natural finetune at 81% of ceiling with genuine target-specific content (+0.10 over zero-information baselines), a sharp drift-vs-readability slope (1.1% angular drift → 19% of ceiling), and the variance-expansion descriptive. The radius question is still open at its interesting end — it needs activations that are actually far: the alien-domain LoRA ladder (#12) or cross-size injection (#14), not a sibling that turned out to be a twin.