← NLA experimentsNatural Language Autoencoders · follow-up series

Evaluation & ablation report

The full report of record: seven headline results, the drift experiment that falsified the geometric ruler, and a look inside the autoencoder itself.

date: 2026-06-10/11
hardware: 2×H200 141GB · 2TB RAM
subject: nla-gemma3-27b-L41-{av,ar} · nla-qwen2.5-7b-L20
artifacts: fve_eval/ results · fve_*.py scripts

TL;DR — seven results

1Released FVE numbers corroborate on held-out Ultra-FineWeb: Gemma-27B 0.775 (released 0.763), Qwen-7B 0.752 (released 0.752, exact).
2The AV’s ubiquitous “quotes” are confabulated reconstructions, not copies (~zero verbatim overlap with true context) — and they carry essentially all of the AR-usable signal; descriptive prose alone reconstructs at predict-the-mean level.
3The code is format-keyed: quote marks alone are worth 0.31 FVE; banning quote tokens at generation forces the AV to re-pack content into prose at a 0.16 FVE cost.
4Best-of-10 buys only +0.03 FVE — the GRPO policy has converged (single temp-1 sample ≈ greedy); the ~0.78 score is a structural ceiling, not sampling noise. Explanation length is likewise RL-pinned at 148±2 tokens.
5Cross-family transfer through text is strong: Gemma explanations reconstruct Qwen’s activations at 79% of Qwen’s own ceiling (and vice versa at 68%) — far above a ridge map fit on 20k aligned pairs.
6Frozen NLAs survive finetuning drift far better than geometry predicts (FVE 0.42 on a cos-0.975-drifted model where additive arithmetic predicts < 0); drift is structured, not content-destroying.
7The AR’s value head is a near-no-op (pre-head state already at cos 0.993 to the target): the trained backbone terminates at the answer.

0.775

Gemma FVE, held-out

released 0.763 · Qwen exact

0.423

FVE on drifted AV model

geometry predicted −0.10

0.807

AR-backbone decode

beats base gold 0.774

148±2

RL-pinned length

tokens, max 151

§0

Setup & the RL-pinned budget

methodology

Eval substrate throughout: 1000 activations from Ultra-FineWeb en docs 100,000–100,200 (held out; training used docs 0–100,000 of the same split), repo stage-0 position sampling, seed 42, bit-reproducible. Extraction is the repo hook convention — output of decoder block 41 = HF hidden_states[42]. Greedy decode via patched SGLang 0.5.9. Metric: FVE = 1 − NMSE/Var against the gold set’s own predict-the-mean variance (Gemma L41: 0.0575; Qwen L20: 0.712). Cosine is nearly useless at Gemma L41 — everything lands 0.95–0.99 — so FVE is the discriminative metric.

One methodological find worth its own flag: the authors’ RL excludes cap-hitting generations from reward rather than truncating them, to neutralize a measured length-hacking gradient (reward–length correlation 0.099). The trained policy responded by becoming de facto constant-N.

148±2

greedy decode length

tokens

151

maximum observed

over 1000 decodes

150

RL generation cap

200 at the client

0.099

reward–length corr

the gradient being policed

Observation —The policy surfs the internalized RL cap and spends the whole budget every time — hence the trailing “or X. or Y.” padding in decodes. Explanation length is not adaptive to content; it is a fixed purchase of 148±2 tokens.

§1

Corroboration

reproduced

	NMSE	cos	FVE	released
Gemma-27B-L41, 1000 held-out UFW	0.013	0.9935	0.775	0.763
Qwen-7B-L20, same positions (char-aligned)	0.1763	0.9118	0.752	0.752 — exact

Eval-set variance baseline also reproduces (0.0575 vs released 0.0579). 0 tag failures in 1000 greedy decodes.

Finding —The extraction/tokenization/scoring pipeline matches the authors’. Slightly-above-released for Gemma is consistent with UFW prose being the clean half of their 50/50 WildChat+UFW train mix.

§2

Quote ablations: where the signal lives

compact recap

Generation-time bans (logit_bias −1e9): banning the "-family alone doesn’t bind — 98% of decodes switch to '...'. Banning both families (1651 tokens) lands 0.616 FVE, with escape hatches (「」, *span*, paren-verbatim) recovering 0.664 on the subset that uses them vs 0.603 hatch-free. And the quotes were never verbatim: median longest shared word-run with the true context is 0, p90 is 4, whether quoting freely or banned — every “quote” is a confabulated reconstruction of context the AV never saw.

Fig. 1 — Post-hoc surgery on the explanation text (AV unconstrained, AR input edited). Spans with marks keep 0.506; the same spans bare drop to 0.196 — the quote marks alone are worth 0.31 FVE. Prose alone sits at predict-the-mean, and noisier replacement tokens actively misdirect the AR. All edited rows are AR-OOD lower bounds.

Finding —Combined picture: spans = payload, marks = type tags, prose = relational glue — an entangled, format-keyed code, with the payload relocatable into prose (at 0.16 FVE) when generation is constrained.

§3

Best-of-N: converged policy

temp 1.0 · AR-selected

selection	NMSE	FVE
greedy reference	0.013	0.775
1 sample	0.0132	0.772
best of 2 / 3 / 5	0.0125 / 0.0122 / 0.0120	0.784 / 0.789 / 0.794
best of 10	0.0116	0.799
mean over all 10	0.0132	0.772

Log-flat, ~+0.01 FVE per doubling. 0/10,000 tag failures.

Finding —Single sample ≈ greedy: GRPO (8 samples/prompt in training) already squeezed out the selection slack — almost no harvestable variance, correspondingly little room to game the verifier. The residual ~0.2 FVE is structural (model capacity, the 148-token budget, what’s verbalizable), not stochastic.

§4

Finetuned-model drift: the ruler breaks

frozen NLA · drifted inputs

Adjacent layers first, as the control: L40/L42 activations through the L41-trained NLA lose exactly what geometry says they should — observed NMSE ≤ intrinsic (0.013) + layer offset (2(1−0.997) = 0.006), FVE 0.695/0.703 vs 0.774. The drop is entirely the geometry between layers. That additive “geometric ruler” is the natural prediction tool for any drifted input. Then we fed the frozen NLA activations from finetuned Gemmas — conveniently, the AV and AR checkpoints themselves are drifted Gemmas.

pair (same text/positions)	NMSE	cos	shared-variance fraction
base vs AR backbone	0.015	0.9925	0.74 — MSE-trained backbone barely moved
base vs AV model	0.0501	0.975	0.13 — SFT+GRPO rebuilt L41 prose encoding

Fig. 2 — Geometric prediction vs measured FVE for frozen-NLA decoding of drifted activations (scored against each model's own gold). The additive ruler says the AV-model's cos-0.975 drift should land below predict-the-mean (−0.10); measured: 0.423. The AR backbone lands 0.807 — above the base model's own 0.774.

input	geometric prediction	measured FVE (own gold)	vs base gold
AV-model L41	−0.10	0.423	0.52
AR-backbone L41	0.51	0.807	0.671

Headline —The additive ruler is falsified for weight drift: finetuning moves representations via structured, shared transforms (mean/norm/anisotropy) that the injection-normalized AV ignores, while position-specific content survives. NLA readability radius extends well beyond cos 0.975. Bonus anomaly: AR-backbone activations decode better than base(0.807 > 0.774) — a lower-variance, “canonicalized” stream plus possible scoring-critic self-affinity. The 0.78 ceiling is input-distribution-relative.

§5

Cross-family transfer & the ridge baseline

compact recap

→ AR qwen

→ AR gemma

AV gemma

0.592within · null -0.89

0.774cross · null -0.76

AV qwen

0.752cross · null -0.91

0.528within · null -0.66

Fig. 3 — Gemma-3-27B-L41 × Qwen2.5-7B-L20, same document positions (char-offset alignment, mean error 0.3 chars). Cross transfer = 79% (gemma→qwen) and 68% (qwen→gemma) of within-family ceilings; shuffled-pair nulls run −0.66 to −0.91, so the signal is position-specific, not corpus style.

n train pairs	gemma→qwen	qwen→gemma
ridge, 1,000	0.095	0.085
ridge, 5,000	0.3	0.172
ridge, 20,000	0.422	0.24
NLA text route (0 paired examples)	0.592	0.528

Ridge: 5-fold/held-out CV, unit-normalized, trained on disjoint UFW docs 102k–106k. The W ∈ ℝ^{5376×3584} map is data-starved even at 20k pairs (~4·d_in), and the curve is decelerating.

Finding —The zero-paired-data text bottleneck beats a 20k-pair linear map by 0.17–0.29 FVE. Asymmetry (the 27B AV transfers better) says explanation quality scales with AV capacity — exposed more sharply by a foreign AR than by the co-adapted home one. Cross cells are AR-OOD lower bounds.

§6

Inside the autoencoder

on-policy internal states

Notation: a = injected base vector; b= the AV’s own layer-41 residual at the injection position during its forward pass; c= the AR’s final-block last-token state, before the value head.

comparison	NMSE	cos	FVE vs gold var
a vs b	0.189	0.906	−2.28
b vs c	0.21	0.895	−0.72
a vs c	0.0134	0.9933	0.767
a vs value_head(c) — full round trip	0.013	0.9935	0.774

Fig. 4 — NMSE between internal states (lower = closer). Encoder-internal (b) and decoder-internal (c) states never meet each other — both sit far from everything in variance terms — yet the AR's pre-head state c lands almost exactly on the original vector a. The value head adds +0.007 FVE, mostly norm correction (c norms run 33% hot).

Architecture —The value head is a near-no-op: the AR backbone’s residual stream already terminates at the reconstruction (cos 0.993 pre-head). “Truncated LM + linear head” trains into “truncated LM that ends in the answer” — one could skip value_head.safetensors at a cost of 0.007 FVE. Meanwhile the AV’s stream at the injection slot departs hard from a (−2.28 in variance units) — consistent with the vector being parsedrather than carried, though one position × one layer can’t distinguish “consumed” from “dispersed across positions”. Encoder- and decoder-internal states do not meet; the decoder converges to a itself. Text is the only interface.

§7

Limitations & ranked follow-ups

Limitations.Every post-hoc text edit (§2) and cross-AR pairing (§5) is AR-OOD — those FVEs are lower bounds with an unquantified format-mismatch component. Best-of-N selection reuses the scoring AR — mooted by tiny gains, but not clean. §6’s a-vs-b is one position × one borrowed layer index. Samples cluster 5-per-doc, so effective CIs ≈ ±0.01–0.015 FVE. The AR-backbone self-affinity in §4 is unresolved.

1
Probe curve through the AV stack— linear recoverability of the injected vector at every layer×position; settles consumed-vs-dispersed and is the corrected version of §6’s weakest cell. Decoder twin: when does the AR’s stream snap to a?
2
Prose-paraphrase rescoring— paraphrase explanations with a third model (preserve meaning, destroy format), re-score: separates semantic content from format-conformity in the 0.775.
3
gemma-3-27b-pt— instruction-tuning as the largest natural “finetune”; §4b now predicts readable, the additive ruler predicts not; cheap decisive test of the readability radius.
4
LoRA ladder— (UFW control → alien domain → behavioral persona) with explanation-content diffing — the “does the NLA verbalize what the finetune did” question the paper’s auditing pitch implies; checkpointed FVE-vs-steps curves.
5
Distant layers (L31/L51)— the layer-transfer experiment where cross-layer cos actually diverges.
6
Context-length isolation— same position, varied prefix length: how much of the vector is local vs global context.

Next —Top of the queue: the AV-stack probe curve (fixes §6’s weakest measurement), paraphrase rescoring (splits semantics from format in the 0.775), and gemma-3-27b-pt as the largest natural finetune — the cheap, decisive test of the readability radius §4 opened up.