Anthropic's released NLA checkpoints, reproduced on held-out data and taken apart: where the reconstruction signal actually lives.
date
2026-06-10/11
hardware
2×H200 141GB
subject
nla-gemma3-27b-L41-{av,ar} · nla-qwen2.5-7b-L20
eval
1000 acts · UFW en 100k–100.2k · seed 42 · FVE
0.775
Gemma FVE (ours)
released 0.763
0.752
Qwen FVE (ours)
released 0.752 — exact
0.506
spans-only FVE
prose-only: −0.09
0.592
cross-family FVE
79% of within ceiling
§1
Corroboration of the released numbers
reproduced
An NLA is a pair of fine-tuned LMs: the AV (activation verbalizer) injects a residual-stream vector as a single token embedding and writes a natural-language explanation; the AR (activation reconstructor) maps the text back to a vector. Round-trip fidelity: FVE = 1 − NMSE/Var, where 0 is predict-the-mean and 1 is perfect.
metric
ours (UFW held-out)
released
Gemma-27B fve_nrm
0.775
0.763 (train, 50/50 WildChat+UFW)
Qwen-7B fve_nrm
0.752
0.752 — exact
Var(v_nrm), Gemma L41
0.0575
0.0579 (train)
95.8% of samples score FVE > 0.5; zero missing-tag decodes in 1000. Variance baselines matching to the third decimal is itself a pipeline sanity check.
Finding —Corroborated. Gemma slightly above the released figure is expected — clean mid-document UFW prose is the easy end of their training mix. Note: L41 is a low-variance layer, so cosine compresses to 0.95–0.99 for everything from perfect to garbage; FVE is the only discriminative metric here.
§2
Quote ablation at generation time
logit_bias −1e9
AV explanations lean heavily on quotation. Banning quote tokens at generation time (every vocab token containing a banned character) tests how much of the fidelity is the quoting strategy itself.
Fig. 1 — FVE by generation-time ban condition. Banning double quotes alone doesn't bind — 982/1000 decodes switch to single quotes. Banning both families forces genuine paraphrase at a ~0.17 FVE cost; 21% of decodes still smuggle delimiters via corner brackets, italics, or parentheses.
Finding —The policy is strikingly persistent about quoting: ban a delimiter and it substitutes another. Forced to pure paraphrase, the system retains ~78% of its FVE — quoting is a high-reward habit, not the load-bearing mechanism.
§3
The quotes are confabulated, not copied
Are the “quotes” verbatim reproductions of the source context? Measured: longest contiguous word n-gram shared between each explanation and its true source, against a shuffled-context null.
decode set
median run
p90
≥6-word runs
null ≥6
quoting freely
0
4
2%
0%
quote-banned, hatch-free
0
4
1%
0%
Finding —Even when quoting freely, quoted spans almost never match the real context beyond ~4 consecutive words. Necessarily so — the AV never sees text, only one 5376-dim vector. Every “quote” is a reconstruction generated from the activation, dressed in quotation marks. There is no copy channel; the §2 drop is format loss.
§4
Post-hoc redaction: where the signal lives
AR input surgery
The AV generates normally (fully in-distribution); the explanation is edited before the AR sees it. ~9.4 quote spans per decode, ~23% of characters.
Fig. 2 — FVE after post-hoc surgery on the explanation. Spans alone keep 0.506; identical span text without the quote marks drops to 0.196 — the delimiters alone are worth 0.31 FVE. Prose alone is at or below predict-the-mean, and junkier replacement tokens actively misdirect the AR. All edited rows are AR-OOD, hence lower bounds.
Finding —The encoding is entangled and format-keyed: spans ≈ payload, marks ≈ type annotations, prose ≈ relational glue. Blocked at generation time the AV re-packs payload into prose (0.616); yanked post-hoc, the remainder is useless. The FVE metric predominantly validates the quoted-span channel — the human-readable descriptive prose contributes only jointly.
§5
Best-of-N: the policy has converged
test-time selection
Ten temp-1.0 samples per activation, best-of-prefix-N selected by the same AR that defines the metric — the training reward used as a test-time verifier. Zero tag failures in 10,000 sampled decodes.
Fig. 3 — FVE vs N (log-ish x, selection over prefix-N samples). A single temp-1 sample ≈ greedy; ten-way selection buys +0.027 total at ~+0.01 per doubling, decaying. The residual ~0.20 FVE is structural, not stochastic.
Finding —GRPO (8 samples/prompt in training) already squeezed out the selection slack. More test-time sampling cannot push past ~0.80 — the released numbers sit close to this system’s ceiling, and the self-dealing concern is bounded by the same result.
§6
Adjacent-layer transfer
L40 / L42 → L41 NLA
input
cos to L41
NMSE
FVE (own denom)
L41 (trained)
—
0.013
0.774
L40
0.997
0.0167
0.695
L42
0.9971
0.0187
0.703
Finding —Observed error sits at or below the additive bound (intrinsic 0.013 + inter-layer offset 0.006): the entire FVE drop is explained by geometry between layers. The NLA’s input side is robust to 0.997-cos perturbations — its brittleness (§4) is in the text format, not the vector interface.
§7
Cross-family transfer
Gemma-27B ⇄ Qwen-7B
Both models’ activations extracted at the same document positions (cross-tokenizer alignment by character offset, mean error 0.3 chars). Each cell: row-AV’s explanation → column-AR, scored against the AR-side family’s own gold.
→ AR qwen
→ AR gemma
AV gemma
0.592within · null -0.89
0.774cross · null -0.76
AV qwen
0.752cross · null -0.91
0.528within · null -0.66
Fig. 4 — Transfer matrix, FVE per cell with shuffled-pair nulls. Cross-family transfer reaches 79% (gemma→qwen) and 68% (qwen→gemma) of the within-family ceilings; nulls at −0.66 to −0.91 confirm the signal is position-specific, not corpus style.
Finding —The NL-expressible content of residual streams is substantially model-agnostic across architecture, tokenizer, and a 4× scale gap. Asymmetry: the 27B AV’s explanations survive a foreign decoder better than the 7B’s — explanation quality scales with AV capacity. Cross-family via text costs about as much as moving one layer within the same model.
§8
Linear baseline: ridge + n-scaling
20k aligned pairs
Kernel ridge on unit-normalized vectors, both directions, trained on fresh disjoint docs (UFW 102k–106k), evaluated on the same 1000-pair set. The text route uses zero paired examples.
Fig. 5 — Ridge FVE vs training pairs (log x), against the zero-shot NLA text route (dashed). Per-doubling ridge gains decelerate; log-linear extrapolation puts the crossover at ~10⁵ pairs for gemma→qwen and ~10⁷ — effectively never — for qwen→gemma.
Finding —The zero-paired-data text route beats a 20k-pair supervised linear map in both directions. At equal or 20×-favorable data budgets, language beats linear alignment as a cross-model representation transporter.
§9
Synthesis
#
claim
1
The release is solid: both families' headline numbers reproduce on held-out data; zero silent failures.
2
An NLA explanation = confabulated context-reconstructions (payload) in format-keyed scaffolding the AR co-evolved to parse.
3
The system is at its ceiling (~0.80): residual error is structural, not sampling noise.
4
Robust where it matters (vector side), brittle where it doesn't (delimiter-keyed text parsing).
5
The text bottleneck is a strong cross-model interlingua: 68–79% of within-family fidelity, worth ≳10⁵ supervised pairs.
Honest scorecard of pre-registered guesses:
✓
Quote ban → 0.55–0.68— 0.616, but only after closing the single-quote escape hatch
✗
Spans are a copy-channel— no copy channel exists (§3)
✗
Spans-only ≈ 0.7+— 0.506 — prose is not free-riding
Cross-family 0.35–0.55— low — representations more convergent than guessed
✗
Ridge closes much of the gap— decisively behind at 20k pairs
Verdict
The release is solid; the most interesting findings are not in the paper’s headline. The explanations are confabulated reconstructions wearing quotation syntax, parsed by the AR as a semi-formal code — and the text bottleneck transfers across model families well enough to beat a 20k-pair linear adapter, the strongest single result of the series.