← NLA experimentsNatural Language Autoencoders · follow-up series
03

Corroboration series — the verdict edition

The scored assessment of Anthropic's NLA release: what reproduced, what the system actually is, and the one number that summarizes it.

date
2026-06-10/11
hardware
2×H200 141GB
subject
nla-gemma3-27b-L41-{av,ar} · qwen2.5-7b-L20
eval
1000 acts · UFW en held-out · seed 42 · FVE
0.775
Gemma FVE (ours)
released 0.763 — reproduced
0.752
Qwen FVE (ours)
released 0.752 — exact match
0.799
best-of-10 ceiling
greedy +0.024 — converged
0.592
cross-family FVE
beats 20k-pair ridge (0.422)
Verdict8.5/10
The release is solid: both published FVE numbers reproduce on held-out data to within noise (one exactly), the eval protocol is internally consistent, and the policy is converged (no cherry-picking headroom — best-of-10 adds 0.024). The most interesting things we found are not in the paper’s headline: the explanations are confabulated reconstructions wearing quotation syntax; the AR reads them as a semi-formal code (delimiters as type tags, prose as relational structure) rather than as free natural language; the input side is robust (adjacent layers transfer at geometric cost only) while the text side is brittle; and the text bottleneck transfers across model families well enough to beat a 20k-pair linear adapter — the strongest evidence in this series that what the NLA extracts is real, model-general semantic content.

Everything below is the evidence file for that paragraph, in compressed form. Full protocol: 1000 activations from Ultra-FineWeb en docs 100000–102000 (held out; training used 0–100k), repo stage-0 sampling, seed 42, extraction at decoder block 41 = HF hidden_states[42]. FVE = 1 − NMSE/Var. One framing caveat governs every table: Gemma L41 is a low-variance layer (Var 0.0579), so cosine compresses into 0.95–0.99 for everything from perfect to garbage — read FVE only.

§1

Corroboration

metricours (UFW held-out)released
Gemma-27B-L41 fve_nrm0.775 (NMSE 0.0130, cos 0.9935)0.763 (train, 50/50 WildChat+UFW)
Qwen2.5-7B-L20 fve_nrm0.7520.752 — exact
Var(v_nrm), Gemma L410.05750.0579

Deterministic: re-running the greedy pipeline reproduced 0.775 bit-exact. Slightly-above for Gemma is expected — pure UFW prose at position ≥50 is the clean end of their train mix.

FindingBoth released numbers reproduce on held-out data; the variance baselines match to the third decimal, itself a pipeline sanity check.
§2

Quote ablation at generation time

conditionNMSEcosFVE
unconstrained0.0130.99350.775
ban "-family (975 tokens)0.01410.9930.757
ban "+' families (1651 tokens)0.02220.98890.616
└ hatch-free subset (n=789)0.0230.98850.603
└ hatch-using subset (n=209)0.01950.99030.664
FindingThe policy is strikingly persistent about quoting: ban double quotes and 98% of decodes switch to single quotes (FVE barely moves). Ban both and 21% reach for escape hatches; true quote-free paraphrase costs ~0.17 FVE — real but minority damage. Fluency held (2/1000 tag failures, 0 banned-char leaks).
corner brackets 「」
77
italic *spans*
92
paren-verbatim
43
em-dashes
6
Fig. 1 The escape hatches: delimiter substitutes among the 1000 both-families-banned decodes. The AV smuggles its quoting format past the logit ban through whatever typography survives.
§3

The quotes were never verbatim

decode setmedian runp90≥6-word runnull ≥6
quoting freely042%0%
quote-banned, hatch-free041%0%

Longest contiguous word run shared between explanation and the actual source context, vs a shuffled-context null.

FindingThe AV cannot copy — it never sees the text, only one 5376-dim vector. Every quoted span is a confabulated reconstruction; the quote marks are a stylistic container. The §2 FVE drop is loss of the trained format, not loss of a copy channel.
§4

Post-hoc redaction: where the signal lives

Same unconstrained decodes; the explanation text is edited before the AR sees it. Redaction stats: 9.4 spans/decode, 23% of characters.

Fig. 2 FVE after post-hoc surgery. Prose alone is ≈ predict-the-mean, and junkier replacement tokens actively misdirect the AR (monotone −0.09 → −0.76). Quote marks alone are worth 0.31 FVE (0.506 vs 0.196 on identical text). All edited rows are AR-OOD, hence lower bounds.
FindingThe halves are entangled: spans alone 0.506, prose alone ≈ 0, together 0.775. The prose is relational glue — it tells the AR what role each span plays. Blocked at generation time the AV relocates payload into prose (0.616); deleted post-hoc, the prose written alongside quotes can’t compensate. Delimiters as type-markers is remarkably brittle for something billed as natural language.
§5

Best-of-N under the AR

selectionNMSEFVE
greedy0.0130.775
1 sample (temp 1.0)0.01320.772
best of 2 / 3 / 50.0125 / 0.0122 / 0.01200.784 / 0.789 / 0.794
best of 100.01160.799

10 temp=1.0 samples per activation; 0 tag failures in 10k samples. Gains are log-flat, ~+0.01 per doubling.

FindingThe GRPO policy has converged: mean single sample ≈ greedy, ten-way selection buys +0.027. The residual ~0.20 FVE is structural, not stochastic — the released numbers sit near the system’s ceiling, and more AV test-time compute is the wrong axis.
§6

Adjacent-layer transfer

inputcos to L41NMSEFVE (own denom)
L41 (trained)0.0130.774
L400.9970.01670.695
L420.99710.01870.703

Adjacent layers sit at cos 0.997; mutual NMSE ~0.006 — smaller than the NLA's own reconstruction error.

FindingObserved cross-layer NMSE lands at/below the additive bound (0.013 intrinsic + 0.006 offset): the entire FVE drop is the geometry between layers — no evidence the AV misreads neighboring-layer inputs at all. The brittleness lives in the text format (§4), not the vector input. One trained pair plausibly covers a band of nearby layers at ~0.07 FVE per 0.003 cos of divergence.
§7

Cross-family transfer

Both models’ activations extracted at the same document positions (tokenizer alignment by character offset, mean error 0.3 chars). Each cell scored against the AR-side family’s own gold; shuffled-pair nulls shown per cell.

→ AR qwen
→ AR gemma
AV gemma
0.592within · null -0.89
0.774cross · null -0.76
AV qwen
0.752cross · null -0.91
0.528within · null -0.66
Fig. 3 Transfer matrix, FVE with shuffled-pair nulls. A Gemma activation, verbalized by Gemma's AV, reconstructs Qwen's representation of the same context at 79% of Qwen's within-family ceiling (reverse: 68%). Deeply negative nulls prove the signal is position-specific, not corpus/style prior. The two layers' normalized variances differ 12× (Qwen 0.712 vs Gemma 0.0575) — FVE does real normalization work here; raw cos is incomparable across columns.
FindingDirect quantitative support for convergent (“platonic”) representations, measured through an actual language bottleneck rather than CKA/probing. Asymmetry: the 27B AV’s explanations survive a foreign AR better than the 7B’s — explanation quality scales with AV capacity. Cross-family ≈ same cost as moving one layer in-model (0.59 vs 0.70).
§7b

Ridge baseline: text beats a linear adapter

n_trainridge gemma→qwenridge qwen→gemma
10000.0950.085
50000.3000.172
199950.4220.240
text route (zero-shot)0.5920.528

Kernel/ridge regression on unit-normalized activations, 5-fold CV / held-out eval, same FVE denominators.

FindingThe NL bottleneck — with nocross-model training — beats a ridge map fit on 20k paired examples, and the ridge curve is flattening. Language is a surprisingly competitive interlingua for residual streams: the AV/AR pair transfers structure a linear map can’t capture at this data scale.
§7c

Self-drift: the AV is no longer the base model

A check the paper doesn’t report: how far did NLA training move each fine-tuned model’s ownresidual stream off the base model’s, on identical inputs at L41?

Fig. 4 NMSE between the base model's L41 stream and each fine-tuned model's, same inputs. AV: cos 0.975, NMSE 0.050; AR: cos 0.992, NMSE 0.015. The AV drifted ~3× more than the AR — and ~4× the NLA's whole reconstruction error (0.013, dashed).
FindingConsistent with the training split: the AV receives the full RL gradient pressure while the AR’s supervised MSE objective keeps it anchored. Worth remembering when treating the AV as “the same model” as the base — at L41 it measurably isn’t, and its drift (0.050) is ~4× the NLA’s reconstruction error (0.013).
§8

Caveats

  1. 1Sampling. n=1000 from 200 docs — clustering inflates naive stderrs; FVE CIs ≈ ±0.01–0.015. Single corpus (UFW prose, positions ≥50); chat-distribution behavior untested here.
  2. 2Format confound. All post-hoc edits (§4) put the AR out of distribution; those numbers are lower bounds on information content, unresolvable without retraining ARs per format.
  3. 3What FVE measures. Round-trip FVE validates the system, not explanation faithfulness: the metric is dominated by the quoted reconstructions + format conventions. The prose claims humans read contribute relational glue (§4), but their semantic accuracy is not what FVE measures.
  4. 4Low-variance layer. Cosine is non-discriminative at Gemma L41; always read FVE.