← NLA experimentsNatural Language Autoencoders · follow-up series

Corroboration & ablation series

Anthropic's released NLA checkpoints, reproduced on held-out data and taken apart: where the reconstruction signal actually lives.

date: 2026-06-10/11
hardware: 2×H200 141GB
subject: nla-gemma3-27b-L41-{av,ar} · nla-qwen2.5-7b-L20
eval: 1000 acts · UFW en 100k–100.2k · seed 42 · FVE

0.775

Gemma FVE (ours)

released 0.763

0.752

Qwen FVE (ours)

released 0.752 — exact

0.506

spans-only FVE

prose-only: −0.09

0.592

cross-family FVE

79% of within ceiling

§1

Corroboration of the released numbers

reproduced

An NLA is a pair of fine-tuned LMs: the AV (activation verbalizer) injects a residual-stream vector as a single token embedding and writes a natural-language explanation; the AR (activation reconstructor) maps the text back to a vector. Round-trip fidelity: FVE = 1 − NMSE/Var, where 0 is predict-the-mean and 1 is perfect.

metric	ours (UFW held-out)	released
Gemma-27B fve_nrm	0.775	0.763 (train, 50/50 WildChat+UFW)
Qwen-7B fve_nrm	0.752	0.752 — exact
Var(v_nrm), Gemma L41	0.0575	0.0579 (train)

95.8% of samples score FVE > 0.5; zero missing-tag decodes in 1000. Variance baselines matching to the third decimal is itself a pipeline sanity check.

Finding —Corroborated. Gemma slightly above the released figure is expected — clean mid-document UFW prose is the easy end of their training mix. Note: L41 is a low-variance layer, so cosine compresses to 0.95–0.99 for everything from perfect to garbage; FVE is the only discriminative metric here.

§2

Quote ablation at generation time

logit_bias −1e9

AV explanations lean heavily on quotation. Banning quote tokens at generation time (every vocab token containing a banned character) tests how much of the fidelity is the quoting strategy itself.

Fig. 1 — FVE by generation-time ban condition. Banning double quotes alone doesn't bind — 982/1000 decodes switch to single quotes. Banning both families forces genuine paraphrase at a ~0.17 FVE cost; 21% of decodes still smuggle delimiters via corner brackets, italics, or parentheses.

Finding —The policy is strikingly persistent about quoting: ban a delimiter and it substitutes another. Forced to pure paraphrase, the system retains ~78% of its FVE — quoting is a high-reward habit, not the load-bearing mechanism.

§3

The quotes are confabulated, not copied

Are the “quotes” verbatim reproductions of the source context? Measured: longest contiguous word n-gram shared between each explanation and its true source, against a shuffled-context null.

decode set	median run	p90	≥6-word runs	null ≥6
quoting freely	0	4	2%	0%
quote-banned, hatch-free	0	4	1%	0%

Finding —Even when quoting freely, quoted spans almost never match the real context beyond ~4 consecutive words. Necessarily so — the AV never sees text, only one 5376-dim vector. Every “quote” is a reconstruction generated from the activation, dressed in quotation marks. There is no copy channel; the §2 drop is format loss.

§4

Post-hoc redaction: where the signal lives

AR input surgery

The AV generates normally (fully in-distribution); the explanation is edited before the AR sees it. ~9.4 quote spans per decode, ~23% of characters.

Fig. 2 — FVE after post-hoc surgery on the explanation. Spans alone keep 0.506; identical span text without the quote marks drops to 0.196 — the delimiters alone are worth 0.31 FVE. Prose alone is at or below predict-the-mean, and junkier replacement tokens actively misdirect the AR. All edited rows are AR-OOD, hence lower bounds.

Finding —The encoding is entangled and format-keyed: spans ≈ payload, marks ≈ type annotations, prose ≈ relational glue. Blocked at generation time the AV re-packs payload into prose (0.616); yanked post-hoc, the remainder is useless. The FVE metric predominantly validates the quoted-span channel — the human-readable descriptive prose contributes only jointly.

§5

Best-of-N: the policy has converged

test-time selection

Ten temp-1.0 samples per activation, best-of-prefix-N selected by the same AR that defines the metric — the training reward used as a test-time verifier. Zero tag failures in 10,000 sampled decodes.

Fig. 3 — FVE vs N (log-ish x, selection over prefix-N samples). A single temp-1 sample ≈ greedy; ten-way selection buys +0.027 total at ~+0.01 per doubling, decaying. The residual ~0.20 FVE is structural, not stochastic.

Finding —GRPO (8 samples/prompt in training) already squeezed out the selection slack. More test-time sampling cannot push past ~0.80 — the released numbers sit close to this system’s ceiling, and the self-dealing concern is bounded by the same result.

§6

Adjacent-layer transfer

L40 / L42 → L41 NLA

input	cos to L41	NMSE	FVE (own denom)
L41 (trained)	—	0.013	0.774
L40	0.997	0.0167	0.695
L42	0.9971	0.0187	0.703

Finding —Observed error sits at or below the additive bound (intrinsic 0.013 + inter-layer offset 0.006): the entire FVE drop is explained by geometry between layers. The NLA’s input side is robust to 0.997-cos perturbations — its brittleness (§4) is in the text format, not the vector interface.

§7

Cross-family transfer

Gemma-27B ⇄ Qwen-7B

Both models’ activations extracted at the same document positions (cross-tokenizer alignment by character offset, mean error 0.3 chars). Each cell: row-AV’s explanation → column-AR, scored against the AR-side family’s own gold.

→ AR qwen

→ AR gemma

AV gemma

0.592within · null -0.89

0.774cross · null -0.76

AV qwen

0.752cross · null -0.91

0.528within · null -0.66

Fig. 4 — Transfer matrix, FVE per cell with shuffled-pair nulls. Cross-family transfer reaches 79% (gemma→qwen) and 68% (qwen→gemma) of the within-family ceilings; nulls at −0.66 to −0.91 confirm the signal is position-specific, not corpus style.

Finding —The NL-expressible content of residual streams is substantially model-agnostic across architecture, tokenizer, and a 4× scale gap. Asymmetry: the 27B AV’s explanations survive a foreign decoder better than the 7B’s — explanation quality scales with AV capacity. Cross-family via text costs about as much as moving one layer within the same model.

§8

Linear baseline: ridge + n-scaling

20k aligned pairs

Kernel ridge on unit-normalized vectors, both directions, trained on fresh disjoint docs (UFW 102k–106k), evaluated on the same 1000-pair set. The text route uses zero paired examples.

Fig. 5 — Ridge FVE vs training pairs (log x), against the zero-shot NLA text route (dashed). Per-doubling ridge gains decelerate; log-linear extrapolation puts the crossover at ~10⁵ pairs for gemma→qwen and ~10⁷ — effectively never — for qwen→gemma.

Finding —The zero-paired-data text route beats a 20k-pair supervised linear map in both directions. At equal or 20×-favorable data budgets, language beats linear alignment as a cross-model representation transporter.

§9

Synthesis

#	claim
1	The release is solid: both families' headline numbers reproduce on held-out data; zero silent failures.
2	An NLA explanation = confabulated context-reconstructions (payload) in format-keyed scaffolding the AR co-evolved to parse.
3	The system is at its ceiling (~0.80): residual error is structural, not sampling noise.
4	Robust where it matters (vector side), brittle where it doesn't (delimiter-keyed text parsing).
5	The text bottleneck is a strong cross-model interlingua: 68–79% of within-family fidelity, worth ≳10⁵ supervised pairs.

Honest scorecard of pre-registered guesses:

✓
Quote ban → 0.55–0.68— 0.616, but only after closing the single-quote escape hatch
✗
Spans are a copy-channel— no copy channel exists (§3)
✗
Spans-only ≈ 0.7+— 0.506 — prose is not free-riding
✗
Best-of-10 ~0.83–0.87— 0.799 — underestimated policy convergence
~
Cross-family 0.35–0.55— low — representations more convergent than guessed
✗
Ridge closes much of the gap— decisively behind at 20k pairs

Verdict

The release is solid; the most interesting findings are not in the paper’s headline. The explanations are confabulated reconstructions wearing quotation syntax, parsed by the AR as a semi-formal code — and the text bottleneck transfers across model families well enough to beat a 20k-pair linear adapter, the strongest single result of the series.