Evaluation & ablation report
The full report of record: seven headline results, the drift experiment that falsified the geometric ruler, and a look inside the autoencoder itself.
- date
- 2026-06-10/11
- hardware
- 2×H200 141GB · 2TB RAM
- subject
- nla-gemma3-27b-L41-{av,ar} · nla-qwen2.5-7b-L20
- artifacts
- fve_eval/ results · fve_*.py scripts
- 1Released FVE numbers corroborate on held-out Ultra-FineWeb: Gemma-27B 0.775 (released 0.763), Qwen-7B 0.752 (released 0.752, exact).
- 2The AV’s ubiquitous “quotes” are confabulated reconstructions, not copies (~zero verbatim overlap with true context) — and they carry essentially all of the AR-usable signal; descriptive prose alone reconstructs at predict-the-mean level.
- 3The code is format-keyed: quote marks alone are worth 0.31 FVE; banning quote tokens at generation forces the AV to re-pack content into prose at a 0.16 FVE cost.
- 4Best-of-10 buys only +0.03 FVE — the GRPO policy has converged (single temp-1 sample ≈ greedy); the ~0.78 score is a structural ceiling, not sampling noise. Explanation length is likewise RL-pinned at 148±2 tokens.
- 5Cross-family transfer through text is strong: Gemma explanations reconstruct Qwen’s activations at 79% of Qwen’s own ceiling (and vice versa at 68%) — far above a ridge map fit on 20k aligned pairs.
- 6Frozen NLAs survive finetuning drift far better than geometry predicts (FVE 0.42 on a cos-0.975-drifted model where additive arithmetic predicts < 0); drift is structured, not content-destroying.
- 7The AR’s value head is a near-no-op (pre-head state already at cos 0.993 to the target): the trained backbone terminates at the answer.
Setup & the RL-pinned budget
Eval substrate throughout: 1000 activations from Ultra-FineWeb en docs 100,000–100,200 (held out; training used docs 0–100,000 of the same split), repo stage-0 position sampling, seed 42, bit-reproducible. Extraction is the repo hook convention — output of decoder block 41 = HF hidden_states[42]. Greedy decode via patched SGLang 0.5.9. Metric: FVE = 1 − NMSE/Var against the gold set’s own predict-the-mean variance (Gemma L41: 0.0575; Qwen L20: 0.712). Cosine is nearly useless at Gemma L41 — everything lands 0.95–0.99 — so FVE is the discriminative metric.
One methodological find worth its own flag: the authors’ RL excludes cap-hitting generations from reward rather than truncating them, to neutralize a measured length-hacking gradient (reward–length correlation 0.099). The trained policy responded by becoming de facto constant-N.
Corroboration
| NMSE | cos | FVE | released | |
|---|---|---|---|---|
| Gemma-27B-L41, 1000 held-out UFW | 0.013 | 0.9935 | 0.775 | 0.763 |
| Qwen-7B-L20, same positions (char-aligned) | 0.1763 | 0.9118 | 0.752 | 0.752 — exact |
Eval-set variance baseline also reproduces (0.0575 vs released 0.0579). 0 tag failures in 1000 greedy decodes.
Quote ablations: where the signal lives
Generation-time bans (logit_bias −1e9): banning the "-family alone doesn’t bind — 98% of decodes switch to '...'. Banning both families (1651 tokens) lands 0.616 FVE, with escape hatches (「」, *span*, paren-verbatim) recovering 0.664 on the subset that uses them vs 0.603 hatch-free. And the quotes were never verbatim: median longest shared word-run with the true context is 0, p90 is 4, whether quoting freely or banned — every “quote” is a confabulated reconstruction of context the AV never saw.
Best-of-N: converged policy
| selection | NMSE | FVE |
|---|---|---|
| greedy reference | 0.013 | 0.775 |
| 1 sample | 0.0132 | 0.772 |
| best of 2 / 3 / 5 | 0.0125 / 0.0122 / 0.0120 | 0.784 / 0.789 / 0.794 |
| best of 10 | 0.0116 | 0.799 |
| mean over all 10 | 0.0132 | 0.772 |
Log-flat, ~+0.01 FVE per doubling. 0/10,000 tag failures.
Finetuned-model drift: the ruler breaks
Adjacent layers first, as the control: L40/L42 activations through the L41-trained NLA lose exactly what geometry says they should — observed NMSE ≤ intrinsic (0.013) + layer offset (2(1−0.997) = 0.006), FVE 0.695/0.703 vs 0.774. The drop is entirely the geometry between layers. That additive “geometric ruler” is the natural prediction tool for any drifted input. Then we fed the frozen NLA activations from finetuned Gemmas — conveniently, the AV and AR checkpoints themselves are drifted Gemmas.
| pair (same text/positions) | NMSE | cos | shared-variance fraction |
|---|---|---|---|
| base vs AR backbone | 0.015 | 0.9925 | 0.74 — MSE-trained backbone barely moved |
| base vs AV model | 0.0501 | 0.975 | 0.13 — SFT+GRPO rebuilt L41 prose encoding |
| input | geometric prediction | measured FVE (own gold) | vs base gold |
|---|---|---|---|
| AV-model L41 | −0.10 | 0.423 | 0.52 |
| AR-backbone L41 | 0.51 | 0.807 | 0.671 |
Cross-family transfer & the ridge baseline
| n train pairs | gemma→qwen | qwen→gemma |
|---|---|---|
| ridge, 1,000 | 0.095 | 0.085 |
| ridge, 5,000 | 0.3 | 0.172 |
| ridge, 20,000 | 0.422 | 0.24 |
| NLA text route (0 paired examples) | 0.592 | 0.528 |
Ridge: 5-fold/held-out CV, unit-normalized, trained on disjoint UFW docs 102k–106k. The W ∈ ℝ^{5376×3584} map is data-starved even at 20k pairs (~4·d_in), and the curve is decelerating.
Inside the autoencoder
Notation: a = injected base vector; b= the AV’s own layer-41 residual at the injection position during its forward pass; c= the AR’s final-block last-token state, before the value head.
| comparison | NMSE | cos | FVE vs gold var |
|---|---|---|---|
| a vs b | 0.189 | 0.906 | −2.28 |
| b vs c | 0.21 | 0.895 | −0.72 |
| a vs c | 0.0134 | 0.9933 | 0.767 |
| a vs value_head(c) — full round trip | 0.013 | 0.9935 | 0.774 |
Limitations & ranked follow-ups
Limitations.Every post-hoc text edit (§2) and cross-AR pairing (§5) is AR-OOD — those FVEs are lower bounds with an unquantified format-mismatch component. Best-of-N selection reuses the scoring AR — mooted by tiny gains, but not clean. §6’s a-vs-b is one position × one borrowed layer index. Samples cluster 5-per-doc, so effective CIs ≈ ±0.01–0.015 FVE. The AR-backbone self-affinity in §4 is unresolved.
- 1Probe curve through the AV stack— linear recoverability of the injected vector at every layer×position; settles consumed-vs-dispersed and is the corrected version of §6’s weakest cell. Decoder twin: when does the AR’s stream snap to a?
- 2Prose-paraphrase rescoring— paraphrase explanations with a third model (preserve meaning, destroy format), re-score: separates semantic content from format-conformity in the 0.775.
- 3gemma-3-27b-pt— instruction-tuning as the largest natural “finetune”; §4b now predicts readable, the additive ruler predicts not; cheap decisive test of the readability radius.
- 4LoRA ladder— (UFW control → alien domain → behavioral persona) with explanation-content diffing — the “does the NLA verbalize what the finetune did” question the paper’s auditing pitch implies; checkpointed FVE-vs-steps curves.
- 5Distant layers (L31/L51)— the layer-transfer experiment where cross-layer cos actually diverges.
- 6Context-length isolation— same position, varied prefix length: how much of the vector is local vs global context.