← NLA experimentsNatural Language Autoencoders · follow-up series
02

Evaluation & ablation report

The full report of record: seven headline results, the drift experiment that falsified the geometric ruler, and a look inside the autoencoder itself.

date
2026-06-10/11
hardware
2×H200 141GB · 2TB RAM
subject
nla-gemma3-27b-L41-{av,ar} · nla-qwen2.5-7b-L20
artifacts
fve_eval/ results · fve_*.py scripts
TL;DR — seven results
  1. 1Released FVE numbers corroborate on held-out Ultra-FineWeb: Gemma-27B 0.775 (released 0.763), Qwen-7B 0.752 (released 0.752, exact).
  2. 2The AV’s ubiquitous “quotes” are confabulated reconstructions, not copies (~zero verbatim overlap with true context) — and they carry essentially all of the AR-usable signal; descriptive prose alone reconstructs at predict-the-mean level.
  3. 3The code is format-keyed: quote marks alone are worth 0.31 FVE; banning quote tokens at generation forces the AV to re-pack content into prose at a 0.16 FVE cost.
  4. 4Best-of-10 buys only +0.03 FVE — the GRPO policy has converged (single temp-1 sample ≈ greedy); the ~0.78 score is a structural ceiling, not sampling noise. Explanation length is likewise RL-pinned at 148±2 tokens.
  5. 5Cross-family transfer through text is strong: Gemma explanations reconstruct Qwen’s activations at 79% of Qwen’s own ceiling (and vice versa at 68%) — far above a ridge map fit on 20k aligned pairs.
  6. 6Frozen NLAs survive finetuning drift far better than geometry predicts (FVE 0.42 on a cos-0.975-drifted model where additive arithmetic predicts < 0); drift is structured, not content-destroying.
  7. 7The AR’s value head is a near-no-op (pre-head state already at cos 0.993 to the target): the trained backbone terminates at the answer.
0.775
Gemma FVE, held-out
released 0.763 · Qwen exact
0.423
FVE on drifted AV model
geometry predicted −0.10
0.807
AR-backbone decode
beats base gold 0.774
148±2
RL-pinned length
tokens, max 151
§0

Setup & the RL-pinned budget

Eval substrate throughout: 1000 activations from Ultra-FineWeb en docs 100,000–100,200 (held out; training used docs 0–100,000 of the same split), repo stage-0 position sampling, seed 42, bit-reproducible. Extraction is the repo hook convention — output of decoder block 41 = HF hidden_states[42]. Greedy decode via patched SGLang 0.5.9. Metric: FVE = 1 − NMSE/Var against the gold set’s own predict-the-mean variance (Gemma L41: 0.0575; Qwen L20: 0.712). Cosine is nearly useless at Gemma L41 — everything lands 0.95–0.99 — so FVE is the discriminative metric.

One methodological find worth its own flag: the authors’ RL excludes cap-hitting generations from reward rather than truncating them, to neutralize a measured length-hacking gradient (reward–length correlation 0.099). The trained policy responded by becoming de facto constant-N.

148±2
greedy decode length
tokens
151
maximum observed
over 1000 decodes
150
RL generation cap
200 at the client
0.099
reward–length corr
the gradient being policed
ObservationThe policy surfs the internalized RL cap and spends the whole budget every time — hence the trailing “or X. or Y.” padding in decodes. Explanation length is not adaptive to content; it is a fixed purchase of 148±2 tokens.
§1

Corroboration

NMSEcosFVEreleased
Gemma-27B-L41, 1000 held-out UFW0.0130.99350.7750.763
Qwen-7B-L20, same positions (char-aligned)0.17630.91180.7520.752 — exact

Eval-set variance baseline also reproduces (0.0575 vs released 0.0579). 0 tag failures in 1000 greedy decodes.

FindingThe extraction/tokenization/scoring pipeline matches the authors’. Slightly-above-released for Gemma is consistent with UFW prose being the clean half of their 50/50 WildChat+UFW train mix.
§2

Quote ablations: where the signal lives

Generation-time bans (logit_bias −1e9): banning the "-family alone doesn’t bind — 98% of decodes switch to '...'. Banning both families (1651 tokens) lands 0.616 FVE, with escape hatches (「」, *span*, paren-verbatim) recovering 0.664 on the subset that uses them vs 0.603 hatch-free. And the quotes were never verbatim: median longest shared word-run with the true context is 0, p90 is 4, whether quoting freely or banned — every “quote” is a confabulated reconstruction of context the AV never saw.

Fig. 1 Post-hoc surgery on the explanation text (AV unconstrained, AR input edited). Spans with marks keep 0.506; the same spans bare drop to 0.196 — the quote marks alone are worth 0.31 FVE. Prose alone sits at predict-the-mean, and noisier replacement tokens actively misdirect the AR. All edited rows are AR-OOD lower bounds.
FindingCombined picture: spans = payload, marks = type tags, prose = relational glue — an entangled, format-keyed code, with the payload relocatable into prose (at 0.16 FVE) when generation is constrained.
§3

Best-of-N: converged policy

selectionNMSEFVE
greedy reference0.0130.775
1 sample0.01320.772
best of 2 / 3 / 50.0125 / 0.0122 / 0.01200.784 / 0.789 / 0.794
best of 100.01160.799
mean over all 100.01320.772

Log-flat, ~+0.01 FVE per doubling. 0/10,000 tag failures.

FindingSingle sample ≈ greedy: GRPO (8 samples/prompt in training) already squeezed out the selection slack — almost no harvestable variance, correspondingly little room to game the verifier. The residual ~0.2 FVE is structural (model capacity, the 148-token budget, what’s verbalizable), not stochastic.
§4

Finetuned-model drift: the ruler breaks

Adjacent layers first, as the control: L40/L42 activations through the L41-trained NLA lose exactly what geometry says they should — observed NMSE ≤ intrinsic (0.013) + layer offset (2(1−0.997) = 0.006), FVE 0.695/0.703 vs 0.774. The drop is entirely the geometry between layers. That additive “geometric ruler” is the natural prediction tool for any drifted input. Then we fed the frozen NLA activations from finetuned Gemmas — conveniently, the AV and AR checkpoints themselves are drifted Gemmas.

pair (same text/positions)NMSEcosshared-variance fraction
base vs AR backbone0.0150.99250.74 — MSE-trained backbone barely moved
base vs AV model0.05010.9750.13 — SFT+GRPO rebuilt L41 prose encoding
Fig. 2 Geometric prediction vs measured FVE for frozen-NLA decoding of drifted activations (scored against each model's own gold). The additive ruler says the AV-model's cos-0.975 drift should land below predict-the-mean (−0.10); measured: 0.423. The AR backbone lands 0.807 — above the base model's own 0.774.
inputgeometric predictionmeasured FVE (own gold)vs base gold
AV-model L41−0.100.4230.52
AR-backbone L410.510.8070.671
HeadlineThe additive ruler is falsified for weight drift: finetuning moves representations via structured, shared transforms (mean/norm/anisotropy) that the injection-normalized AV ignores, while position-specific content survives. NLA readability radius extends well beyond cos 0.975. Bonus anomaly: AR-backbone activations decode better than base(0.807 > 0.774) — a lower-variance, “canonicalized” stream plus possible scoring-critic self-affinity. The 0.78 ceiling is input-distribution-relative.
§5

Cross-family transfer & the ridge baseline

→ AR qwen
→ AR gemma
AV gemma
0.592within · null -0.89
0.774cross · null -0.76
AV qwen
0.752cross · null -0.91
0.528within · null -0.66
Fig. 3 Gemma-3-27B-L41 × Qwen2.5-7B-L20, same document positions (char-offset alignment, mean error 0.3 chars). Cross transfer = 79% (gemma→qwen) and 68% (qwen→gemma) of within-family ceilings; shuffled-pair nulls run −0.66 to −0.91, so the signal is position-specific, not corpus style.
n train pairsgemma→qwenqwen→gemma
ridge, 1,0000.0950.085
ridge, 5,0000.30.172
ridge, 20,0000.4220.24
NLA text route (0 paired examples)0.5920.528

Ridge: 5-fold/held-out CV, unit-normalized, trained on disjoint UFW docs 102k–106k. The W ∈ ℝ^{5376×3584} map is data-starved even at 20k pairs (~4·d_in), and the curve is decelerating.

FindingThe zero-paired-data text bottleneck beats a 20k-pair linear map by 0.17–0.29 FVE. Asymmetry (the 27B AV transfers better) says explanation quality scales with AV capacity — exposed more sharply by a foreign AR than by the co-adapted home one. Cross cells are AR-OOD lower bounds.
§6

Inside the autoencoder

Notation: a = injected base vector; b= the AV’s own layer-41 residual at the injection position during its forward pass; c= the AR’s final-block last-token state, before the value head.

comparisonNMSEcosFVE vs gold var
a vs b0.1890.906−2.28
b vs c0.210.895−0.72
a vs c0.01340.99330.767
a vs value_head(c) — full round trip0.0130.99350.774
Fig. 4 NMSE between internal states (lower = closer). Encoder-internal (b) and decoder-internal (c) states never meet each other — both sit far from everything in variance terms — yet the AR's pre-head state c lands almost exactly on the original vector a. The value head adds +0.007 FVE, mostly norm correction (c norms run 33% hot).
ArchitectureThe value head is a near-no-op: the AR backbone’s residual stream already terminates at the reconstruction (cos 0.993 pre-head). “Truncated LM + linear head” trains into “truncated LM that ends in the answer” — one could skip value_head.safetensors at a cost of 0.007 FVE. Meanwhile the AV’s stream at the injection slot departs hard from a (−2.28 in variance units) — consistent with the vector being parsedrather than carried, though one position × one layer can’t distinguish “consumed” from “dispersed across positions”. Encoder- and decoder-internal states do not meet; the decoder converges to a itself. Text is the only interface.
§7

Limitations & ranked follow-ups

Limitations.Every post-hoc text edit (§2) and cross-AR pairing (§5) is AR-OOD — those FVEs are lower bounds with an unquantified format-mismatch component. Best-of-N selection reuses the scoring AR — mooted by tiny gains, but not clean. §6’s a-vs-b is one position × one borrowed layer index. Samples cluster 5-per-doc, so effective CIs ≈ ±0.01–0.015 FVE. The AR-backbone self-affinity in §4 is unresolved.

  1. 1
    Probe curve through the AV stacklinear recoverability of the injected vector at every layer×position; settles consumed-vs-dispersed and is the corrected version of §6’s weakest cell. Decoder twin: when does the AR’s stream snap to a?
  2. 2
    Prose-paraphrase rescoringparaphrase explanations with a third model (preserve meaning, destroy format), re-score: separates semantic content from format-conformity in the 0.775.
  3. 3
    gemma-3-27b-ptinstruction-tuning as the largest natural “finetune”; §4b now predicts readable, the additive ruler predicts not; cheap decisive test of the readability radius.
  4. 4
    LoRA ladder(UFW control → alien domain → behavioral persona) with explanation-content diffing — the “does the NLA verbalize what the finetune did” question the paper’s auditing pitch implies; checkpointed FVE-vs-steps curves.
  5. 5
    Distant layers (L31/L51)the layer-transfer experiment where cross-layer cos actually diverges.
  6. 6
    Context-length isolationsame position, varied prefix length: how much of the vector is local vs global context.
NextTop of the queue: the AV-stack probe curve (fixes §6’s weakest measurement), paraphrase rescoring (splits semantics from format in the 0.775), and gemma-3-27b-pt as the largest natural finetune — the cheap, decisive test of the readability radius §4 opened up.