← NLA experimentsNatural Language Autoencoders · follow-up series
01

Corroboration & ablation series

Anthropic's released NLA checkpoints, reproduced on held-out data and taken apart: where the reconstruction signal actually lives.

date
2026-06-10/11
hardware
2×H200 141GB
subject
nla-gemma3-27b-L41-{av,ar} · nla-qwen2.5-7b-L20
eval
1000 acts · UFW en 100k–100.2k · seed 42 · FVE
0.775
Gemma FVE (ours)
released 0.763
0.752
Qwen FVE (ours)
released 0.752 — exact
0.506
spans-only FVE
prose-only: −0.09
0.592
cross-family FVE
79% of within ceiling
§1

Corroboration of the released numbers

An NLA is a pair of fine-tuned LMs: the AV (activation verbalizer) injects a residual-stream vector as a single token embedding and writes a natural-language explanation; the AR (activation reconstructor) maps the text back to a vector. Round-trip fidelity: FVE = 1 − NMSE/Var, where 0 is predict-the-mean and 1 is perfect.

metricours (UFW held-out)released
Gemma-27B fve_nrm0.7750.763 (train, 50/50 WildChat+UFW)
Qwen-7B fve_nrm0.7520.752 — exact
Var(v_nrm), Gemma L410.05750.0579 (train)

95.8% of samples score FVE > 0.5; zero missing-tag decodes in 1000. Variance baselines matching to the third decimal is itself a pipeline sanity check.

FindingCorroborated. Gemma slightly above the released figure is expected — clean mid-document UFW prose is the easy end of their training mix. Note: L41 is a low-variance layer, so cosine compresses to 0.95–0.99 for everything from perfect to garbage; FVE is the only discriminative metric here.
§2

Quote ablation at generation time

AV explanations lean heavily on quotation. Banning quote tokens at generation time (every vocab token containing a banned character) tests how much of the fidelity is the quoting strategy itself.

Fig. 1 FVE by generation-time ban condition. Banning double quotes alone doesn't bind — 982/1000 decodes switch to single quotes. Banning both families forces genuine paraphrase at a ~0.17 FVE cost; 21% of decodes still smuggle delimiters via corner brackets, italics, or parentheses.
FindingThe policy is strikingly persistent about quoting: ban a delimiter and it substitutes another. Forced to pure paraphrase, the system retains ~78% of its FVE — quoting is a high-reward habit, not the load-bearing mechanism.
§3

The quotes are confabulated, not copied

Are the “quotes” verbatim reproductions of the source context? Measured: longest contiguous word n-gram shared between each explanation and its true source, against a shuffled-context null.

decode setmedian runp90≥6-word runsnull ≥6
quoting freely042%0%
quote-banned, hatch-free041%0%
FindingEven when quoting freely, quoted spans almost never match the real context beyond ~4 consecutive words. Necessarily so — the AV never sees text, only one 5376-dim vector. Every “quote” is a reconstruction generated from the activation, dressed in quotation marks. There is no copy channel; the §2 drop is format loss.
§4

Post-hoc redaction: where the signal lives

The AV generates normally (fully in-distribution); the explanation is edited before the AR sees it. ~9.4 quote spans per decode, ~23% of characters.

Fig. 2 FVE after post-hoc surgery on the explanation. Spans alone keep 0.506; identical span text without the quote marks drops to 0.196 — the delimiters alone are worth 0.31 FVE. Prose alone is at or below predict-the-mean, and junkier replacement tokens actively misdirect the AR. All edited rows are AR-OOD, hence lower bounds.
FindingThe encoding is entangled and format-keyed: spans ≈ payload, marks ≈ type annotations, prose ≈ relational glue. Blocked at generation time the AV re-packs payload into prose (0.616); yanked post-hoc, the remainder is useless. The FVE metric predominantly validates the quoted-span channel — the human-readable descriptive prose contributes only jointly.
§5

Best-of-N: the policy has converged

Ten temp-1.0 samples per activation, best-of-prefix-N selected by the same AR that defines the metric — the training reward used as a test-time verifier. Zero tag failures in 10,000 sampled decodes.

Fig. 3 FVE vs N (log-ish x, selection over prefix-N samples). A single temp-1 sample ≈ greedy; ten-way selection buys +0.027 total at ~+0.01 per doubling, decaying. The residual ~0.20 FVE is structural, not stochastic.
FindingGRPO (8 samples/prompt in training) already squeezed out the selection slack. More test-time sampling cannot push past ~0.80 — the released numbers sit close to this system’s ceiling, and the self-dealing concern is bounded by the same result.
§6

Adjacent-layer transfer

inputcos to L41NMSEFVE (own denom)
L41 (trained)0.0130.774
L400.9970.01670.695
L420.99710.01870.703
FindingObserved error sits at or below the additive bound (intrinsic 0.013 + inter-layer offset 0.006): the entire FVE drop is explained by geometry between layers. The NLA’s input side is robust to 0.997-cos perturbations — its brittleness (§4) is in the text format, not the vector interface.
§7

Cross-family transfer

Both models’ activations extracted at the same document positions (cross-tokenizer alignment by character offset, mean error 0.3 chars). Each cell: row-AV’s explanation → column-AR, scored against the AR-side family’s own gold.

→ AR qwen
→ AR gemma
AV gemma
0.592within · null -0.89
0.774cross · null -0.76
AV qwen
0.752cross · null -0.91
0.528within · null -0.66
Fig. 4 Transfer matrix, FVE per cell with shuffled-pair nulls. Cross-family transfer reaches 79% (gemma→qwen) and 68% (qwen→gemma) of the within-family ceilings; nulls at −0.66 to −0.91 confirm the signal is position-specific, not corpus style.
FindingThe NL-expressible content of residual streams is substantially model-agnostic across architecture, tokenizer, and a 4× scale gap. Asymmetry: the 27B AV’s explanations survive a foreign decoder better than the 7B’s — explanation quality scales with AV capacity. Cross-family via text costs about as much as moving one layer within the same model.
§8

Linear baseline: ridge + n-scaling

Kernel ridge on unit-normalized vectors, both directions, trained on fresh disjoint docs (UFW 102k–106k), evaluated on the same 1000-pair set. The text route uses zero paired examples.

Fig. 5 Ridge FVE vs training pairs (log x), against the zero-shot NLA text route (dashed). Per-doubling ridge gains decelerate; log-linear extrapolation puts the crossover at ~10⁵ pairs for gemma→qwen and ~10⁷ — effectively never — for qwen→gemma.
FindingThe zero-paired-data text route beats a 20k-pair supervised linear map in both directions. At equal or 20×-favorable data budgets, language beats linear alignment as a cross-model representation transporter.
§9

Synthesis

#claim
1The release is solid: both families' headline numbers reproduce on held-out data; zero silent failures.
2An NLA explanation = confabulated context-reconstructions (payload) in format-keyed scaffolding the AR co-evolved to parse.
3The system is at its ceiling (~0.80): residual error is structural, not sampling noise.
4Robust where it matters (vector side), brittle where it doesn't (delimiter-keyed text parsing).
5The text bottleneck is a strong cross-model interlingua: 68–79% of within-family fidelity, worth ≳10⁵ supervised pairs.

Honest scorecard of pre-registered guesses:

Verdict
The release is solid; the most interesting findings are not in the paper’s headline. The explanations are confabulated reconstructions wearing quotation syntax, parsed by the AR as a semi-formal code — and the text bottleneck transfers across model families well enough to beat a 20k-pair linear adapter, the strongest single result of the series.