← NLA experimentsNatural Language Autoencoders · follow-up series

Corroboration series — the verdict edition

The scored assessment of Anthropic's NLA release: what reproduced, what the system actually is, and the one number that summarizes it.

date: 2026-06-10/11
hardware: 2×H200 141GB
subject: nla-gemma3-27b-L41-{av,ar} · qwen2.5-7b-L20
eval: 1000 acts · UFW en held-out · seed 42 · FVE

0.775

Gemma FVE (ours)

released 0.763 — reproduced

0.752

Qwen FVE (ours)

released 0.752 — exact match

0.799

best-of-10 ceiling

greedy +0.024 — converged

0.592

cross-family FVE

beats 20k-pair ridge (0.422)

Verdict8.5/10

The release is solid: both published FVE numbers reproduce on held-out data to within noise (one exactly), the eval protocol is internally consistent, and the policy is converged (no cherry-picking headroom — best-of-10 adds 0.024). The most interesting things we found are not in the paper’s headline: the explanations are confabulated reconstructions wearing quotation syntax; the AR reads them as a semi-formal code (delimiters as type tags, prose as relational structure) rather than as free natural language; the input side is robust (adjacent layers transfer at geometric cost only) while the text side is brittle; and the text bottleneck transfers across model families well enough to beat a 20k-pair linear adapter — the strongest evidence in this series that what the NLA extracts is real, model-general semantic content.

Everything below is the evidence file for that paragraph, in compressed form. Full protocol: 1000 activations from Ultra-FineWeb en docs 100000–102000 (held out; training used 0–100k), repo stage-0 sampling, seed 42, extraction at decoder block 41 = HF hidden_states[42]. FVE = 1 − NMSE/Var. One framing caveat governs every table: Gemma L41 is a low-variance layer (Var 0.0579), so cosine compresses into 0.95–0.99 for everything from perfect to garbage — read FVE only.

§1

Corroboration

reproduced ✓

metric	ours (UFW held-out)	released
Gemma-27B-L41 fve_nrm	0.775 (NMSE 0.0130, cos 0.9935)	0.763 (train, 50/50 WildChat+UFW)
Qwen2.5-7B-L20 fve_nrm	0.752	0.752 — exact
Var(v_nrm), Gemma L41	0.0575	0.0579

Deterministic: re-running the greedy pipeline reproduced 0.775 bit-exact. Slightly-above for Gemma is expected — pure UFW prose at position ≥50 is the clean end of their train mix.

Finding —Both released numbers reproduce on held-out data; the variance baselines match to the third decimal, itself a pipeline sanity check.

§2

Quote ablation at generation time

logit_bias −1e9

condition	NMSE	cos	FVE
unconstrained	0.013	0.9935	0.775
ban "-family (975 tokens)	0.0141	0.993	0.757
ban "+' families (1651 tokens)	0.0222	0.9889	0.616
└ hatch-free subset (n=789)	0.023	0.9885	0.603
└ hatch-using subset (n=209)	0.0195	0.9903	0.664

Finding —The policy is strikingly persistent about quoting: ban double quotes and 98% of decodes switch to single quotes (FVE barely moves). Ban both and 21% reach for escape hatches; true quote-free paraphrase costs ~0.17 FVE — real but minority damage. Fluency held (2/1000 tag failures, 0 banned-char leaks).

corner brackets 「」

italic *spans*

paren-verbatim

em-dashes

Fig. 1 — The escape hatches: delimiter substitutes among the 1000 both-families-banned decodes. The AV smuggles its quoting format past the logit ban through whatever typography survives.

§3

The quotes were never verbatim

decode set	median run	p90	≥6-word run	null ≥6
quoting freely	0	4	2%	0%
quote-banned, hatch-free	0	4	1%	0%

Longest contiguous word run shared between explanation and the actual source context, vs a shuffled-context null.

Finding —The AV cannot copy — it never sees the text, only one 5376-dim vector. Every quoted span is a confabulated reconstruction; the quote marks are a stylistic container. The §2 FVE drop is loss of the trained format, not loss of a copy channel.

§4

Post-hoc redaction: where the signal lives

AR input surgery

Same unconstrained decodes; the explanation text is edited before the AR sees it. Redaction stats: 9.4 spans/decode, 23% of characters.

Fig. 2 — FVE after post-hoc surgery. Prose alone is ≈ predict-the-mean, and junkier replacement tokens actively misdirect the AR (monotone −0.09 → −0.76). Quote marks alone are worth 0.31 FVE (0.506 vs 0.196 on identical text). All edited rows are AR-OOD, hence lower bounds.

Finding —The halves are entangled: spans alone 0.506, prose alone ≈ 0, together 0.775. The prose is relational glue — it tells the AR what role each span plays. Blocked at generation time the AV relocates payload into prose (0.616); deleted post-hoc, the prose written alongside quotes can’t compensate. Delimiters as type-markers is remarkably brittle for something billed as natural language.

§5

Best-of-N under the AR

test-time selection

selection	NMSE	FVE
greedy	0.013	0.775
1 sample (temp 1.0)	0.0132	0.772
best of 2 / 3 / 5	0.0125 / 0.0122 / 0.0120	0.784 / 0.789 / 0.794
best of 10	0.0116	0.799

10 temp=1.0 samples per activation; 0 tag failures in 10k samples. Gains are log-flat, ~+0.01 per doubling.

Finding —The GRPO policy has converged: mean single sample ≈ greedy, ten-way selection buys +0.027. The residual ~0.20 FVE is structural, not stochastic — the released numbers sit near the system’s ceiling, and more AV test-time compute is the wrong axis.

§6

Adjacent-layer transfer

L40 / L42 → L41 pair

input	cos to L41	NMSE	FVE (own denom)
L41 (trained)	—	0.013	0.774
L40	0.997	0.0167	0.695
L42	0.9971	0.0187	0.703

Adjacent layers sit at cos 0.997; mutual NMSE ~0.006 — smaller than the NLA's own reconstruction error.

Finding —Observed cross-layer NMSE lands at/below the additive bound (0.013 intrinsic + 0.006 offset): the entire FVE drop is the geometry between layers — no evidence the AV misreads neighboring-layer inputs at all. The brittleness lives in the text format (§4), not the vector input. One trained pair plausibly covers a band of nearby layers at ~0.07 FVE per 0.003 cos of divergence.

§7

Cross-family transfer

Gemma-27B-L41 × Qwen-7B-L20

Both models’ activations extracted at the same document positions (tokenizer alignment by character offset, mean error 0.3 chars). Each cell scored against the AR-side family’s own gold; shuffled-pair nulls shown per cell.

→ AR qwen

→ AR gemma

AV gemma

0.592within · null -0.89

0.774cross · null -0.76

AV qwen

0.752cross · null -0.91

0.528within · null -0.66

Fig. 3 — Transfer matrix, FVE with shuffled-pair nulls. A Gemma activation, verbalized by Gemma's AV, reconstructs Qwen's representation of the same context at 79% of Qwen's within-family ceiling (reverse: 68%). Deeply negative nulls prove the signal is position-specific, not corpus/style prior. The two layers' normalized variances differ 12× (Qwen 0.712 vs Gemma 0.0575) — FVE does real normalization work here; raw cos is incomparable across columns.

Finding —Direct quantitative support for convergent (“platonic”) representations, measured through an actual language bottleneck rather than CKA/probing. Asymmetry: the 27B AV’s explanations survive a foreign AR better than the 7B’s — explanation quality scales with AV capacity. Cross-family ≈ same cost as moving one layer in-model (0.59 vs 0.70).

§7b

Ridge baseline: text beats a linear adapter

up to ~20k pairs

n_train	ridge gemma→qwen	ridge qwen→gemma
1000	0.095	0.085
5000	0.300	0.172
19995	0.422	0.240
text route (zero-shot)	0.592	0.528

Kernel/ridge regression on unit-normalized activations, 5-fold CV / held-out eval, same FVE denominators.

Finding —The NL bottleneck — with nocross-model training — beats a ridge map fit on 20k paired examples, and the ridge curve is flattening. Language is a surprisingly competitive interlingua for residual streams: the AV/AR pair transfers structure a linear map can’t capture at this data scale.

§7c

Self-drift: the AV is no longer the base model

fine-tuned vs base, L41

A check the paper doesn’t report: how far did NLA training move each fine-tuned model’s ownresidual stream off the base model’s, on identical inputs at L41?

Fig. 4 — NMSE between the base model's L41 stream and each fine-tuned model's, same inputs. AV: cos 0.975, NMSE 0.050; AR: cos 0.992, NMSE 0.015. The AV drifted ~3× more than the AR — and ~4× the NLA's whole reconstruction error (0.013, dashed).

Finding —Consistent with the training split: the AV receives the full RL gradient pressure while the AR’s supervised MSE objective keeps it anchored. Worth remembering when treating the AV as “the same model” as the base — at L41 it measurably isn’t, and its drift (0.050) is ~4× the NLA’s reconstruction error (0.013).

§8

Caveats

cross-cutting

1Sampling. n=1000 from 200 docs — clustering inflates naive stderrs; FVE CIs ≈ ±0.01–0.015. Single corpus (UFW prose, positions ≥50); chat-distribution behavior untested here.
2Format confound. All post-hoc edits (§4) put the AR out of distribution; those numbers are lower bounds on information content, unresolvable without retraining ARs per format.
3What FVE measures. Round-trip FVE validates the system, not explanation faithfulness: the metric is dominated by the quoted reconstructions + format conventions. The prose claims humans read contribute relational glue (§4), but their semantic accuracy is not what FVE measures.
4Low-variance layer. Cosine is non-discriminative at Gemma L41; always read FVE.