The scored assessment of Anthropic's NLA release: what reproduced, what the system actually is, and the one number that summarizes it.
date
2026-06-10/11
hardware
2×H200 141GB
subject
nla-gemma3-27b-L41-{av,ar} · qwen2.5-7b-L20
eval
1000 acts · UFW en held-out · seed 42 · FVE
0.775
Gemma FVE (ours)
released 0.763 — reproduced
0.752
Qwen FVE (ours)
released 0.752 — exact match
0.799
best-of-10 ceiling
greedy +0.024 — converged
0.592
cross-family FVE
beats 20k-pair ridge (0.422)
Verdict8.5/10
The release is solid: both published FVE numbers reproduce on held-out data to within noise (one exactly), the eval protocol is internally consistent, and the policy is converged (no cherry-picking headroom — best-of-10 adds 0.024). The most interesting things we found are not in the paper’s headline: the explanations are confabulated reconstructions wearing quotation syntax; the AR reads them as a semi-formal code (delimiters as type tags, prose as relational structure) rather than as free natural language; the input side is robust (adjacent layers transfer at geometric cost only) while the text side is brittle; and the text bottleneck transfers across model families well enough to beat a 20k-pair linear adapter — the strongest evidence in this series that what the NLA extracts is real, model-general semantic content.
Everything below is the evidence file for that paragraph, in compressed form. Full protocol: 1000 activations from Ultra-FineWeb en docs 100000–102000 (held out; training used 0–100k), repo stage-0 sampling, seed 42, extraction at decoder block 41 = HF hidden_states[42]. FVE = 1 − NMSE/Var. One framing caveat governs every table: Gemma L41 is a low-variance layer (Var 0.0579), so cosine compresses into 0.95–0.99 for everything from perfect to garbage — read FVE only.
§1
Corroboration
reproduced ✓
metric
ours (UFW held-out)
released
Gemma-27B-L41 fve_nrm
0.775 (NMSE 0.0130, cos 0.9935)
0.763 (train, 50/50 WildChat+UFW)
Qwen2.5-7B-L20 fve_nrm
0.752
0.752 — exact
Var(v_nrm), Gemma L41
0.0575
0.0579
Deterministic: re-running the greedy pipeline reproduced 0.775 bit-exact. Slightly-above for Gemma is expected — pure UFW prose at position ≥50 is the clean end of their train mix.
Finding —Both released numbers reproduce on held-out data; the variance baselines match to the third decimal, itself a pipeline sanity check.
§2
Quote ablation at generation time
logit_bias −1e9
condition
NMSE
cos
FVE
unconstrained
0.013
0.9935
0.775
ban "-family (975 tokens)
0.0141
0.993
0.757
ban "+' families (1651 tokens)
0.0222
0.9889
0.616
└ hatch-free subset (n=789)
0.023
0.9885
0.603
└ hatch-using subset (n=209)
0.0195
0.9903
0.664
Finding —The policy is strikingly persistent about quoting: ban double quotes and 98% of decodes switch to single quotes (FVE barely moves). Ban both and 21% reach for escape hatches; true quote-free paraphrase costs ~0.17 FVE — real but minority damage. Fluency held (2/1000 tag failures, 0 banned-char leaks).
corner brackets 「」77
italic *spans*92
paren-verbatim43
em-dashes6
Fig. 1 — The escape hatches: delimiter substitutes among the 1000 both-families-banned decodes. The AV smuggles its quoting format past the logit ban through whatever typography survives.
§3
The quotes were never verbatim
decode set
median run
p90
≥6-word run
null ≥6
quoting freely
0
4
2%
0%
quote-banned, hatch-free
0
4
1%
0%
Longest contiguous word run shared between explanation and the actual source context, vs a shuffled-context null.
Finding —The AV cannot copy — it never sees the text, only one 5376-dim vector. Every quoted span is a confabulated reconstruction; the quote marks are a stylistic container. The §2 FVE drop is loss of the trained format, not loss of a copy channel.
§4
Post-hoc redaction: where the signal lives
AR input surgery
Same unconstrained decodes; the explanation text is edited before the AR sees it. Redaction stats: 9.4 spans/decode, 23% of characters.
Fig. 2 — FVE after post-hoc surgery. Prose alone is ≈ predict-the-mean, and junkier replacement tokens actively misdirect the AR (monotone −0.09 → −0.76). Quote marks alone are worth 0.31 FVE (0.506 vs 0.196 on identical text). All edited rows are AR-OOD, hence lower bounds.
Finding —The halves are entangled: spans alone 0.506, prose alone ≈ 0, together 0.775. The prose is relational glue — it tells the AR what role each span plays. Blocked at generation time the AV relocates payload into prose (0.616); deleted post-hoc, the prose written alongside quotes can’t compensate. Delimiters as type-markers is remarkably brittle for something billed as natural language.
§5
Best-of-N under the AR
test-time selection
selection
NMSE
FVE
greedy
0.013
0.775
1 sample (temp 1.0)
0.0132
0.772
best of 2 / 3 / 5
0.0125 / 0.0122 / 0.0120
0.784 / 0.789 / 0.794
best of 10
0.0116
0.799
10 temp=1.0 samples per activation; 0 tag failures in 10k samples. Gains are log-flat, ~+0.01 per doubling.
Finding —The GRPO policy has converged: mean single sample ≈ greedy, ten-way selection buys +0.027. The residual ~0.20 FVE is structural, not stochastic — the released numbers sit near the system’s ceiling, and more AV test-time compute is the wrong axis.
§6
Adjacent-layer transfer
L40 / L42 → L41 pair
input
cos to L41
NMSE
FVE (own denom)
L41 (trained)
—
0.013
0.774
L40
0.997
0.0167
0.695
L42
0.9971
0.0187
0.703
Adjacent layers sit at cos 0.997; mutual NMSE ~0.006 — smaller than the NLA's own reconstruction error.
Finding —Observed cross-layer NMSE lands at/below the additive bound (0.013 intrinsic + 0.006 offset): the entire FVE drop is the geometry between layers — no evidence the AV misreads neighboring-layer inputs at all. The brittleness lives in the text format (§4), not the vector input. One trained pair plausibly covers a band of nearby layers at ~0.07 FVE per 0.003 cos of divergence.
§7
Cross-family transfer
Gemma-27B-L41 × Qwen-7B-L20
Both models’ activations extracted at the same document positions (tokenizer alignment by character offset, mean error 0.3 chars). Each cell scored against the AR-side family’s own gold; shuffled-pair nulls shown per cell.
→ AR qwen
→ AR gemma
AV gemma
0.592within · null -0.89
0.774cross · null -0.76
AV qwen
0.752cross · null -0.91
0.528within · null -0.66
Fig. 3 — Transfer matrix, FVE with shuffled-pair nulls. A Gemma activation, verbalized by Gemma's AV, reconstructs Qwen's representation of the same context at 79% of Qwen's within-family ceiling (reverse: 68%). Deeply negative nulls prove the signal is position-specific, not corpus/style prior. The two layers' normalized variances differ 12× (Qwen 0.712 vs Gemma 0.0575) — FVE does real normalization work here; raw cos is incomparable across columns.
Finding —Direct quantitative support for convergent (“platonic”) representations, measured through an actual language bottleneck rather than CKA/probing. Asymmetry: the 27B AV’s explanations survive a foreign AR better than the 7B’s — explanation quality scales with AV capacity. Cross-family ≈ same cost as moving one layer in-model (0.59 vs 0.70).
§7b
Ridge baseline: text beats a linear adapter
up to ~20k pairs
n_train
ridge gemma→qwen
ridge qwen→gemma
1000
0.095
0.085
5000
0.300
0.172
19995
0.422
0.240
text route (zero-shot)
0.592
0.528
Kernel/ridge regression on unit-normalized activations, 5-fold CV / held-out eval, same FVE denominators.
Finding —The NL bottleneck — with nocross-model training — beats a ridge map fit on 20k paired examples, and the ridge curve is flattening. Language is a surprisingly competitive interlingua for residual streams: the AV/AR pair transfers structure a linear map can’t capture at this data scale.
§7c
Self-drift: the AV is no longer the base model
fine-tuned vs base, L41
A check the paper doesn’t report: how far did NLA training move each fine-tuned model’s ownresidual stream off the base model’s, on identical inputs at L41?
Fig. 4 — NMSE between the base model's L41 stream and each fine-tuned model's, same inputs. AV: cos 0.975, NMSE 0.050; AR: cos 0.992, NMSE 0.015. The AV drifted ~3× more than the AR — and ~4× the NLA's whole reconstruction error (0.013, dashed).
Finding —Consistent with the training split: the AV receives the full RL gradient pressure while the AR’s supervised MSE objective keeps it anchored. Worth remembering when treating the AV as “the same model” as the base — at L41 it measurably isn’t, and its drift (0.050) is ~4× the NLA’s reconstruction error (0.013).
§8
Caveats
cross-cutting
1Sampling. n=1000 from 200 docs — clustering inflates naive stderrs; FVE CIs ≈ ±0.01–0.015. Single corpus (UFW prose, positions ≥50); chat-distribution behavior untested here.
2Format confound. All post-hoc edits (§4) put the AR out of distribution; those numbers are lower bounds on information content, unresolvable without retraining ARs per format.
3What FVE measures. Round-trip FVE validates the system, not explanation faithfulness: the metric is dominated by the quoted reconstructions + format conventions. The prose claims humans read contribute relational glue (§4), but their semantic accuracy is not what FVE measures.
4Low-variance layer. Cosine is non-discriminative at Gemma L41; always read FVE.