← NLA experimentsNatural Language Autoencoders · follow-up series
04

Probing the released Gemma-3-27B NLA: corroboration, ablations, transfer

The full writeup of the series — three independent corroborations of the release, the anatomy of the explanation code, the cross-family interlingua result, and a verdict on the OpenInterp critique.

date
2026-06-10/11
hardware
2×H200 141GB · 2TB RAM
serving
SGLang 0.5.9 + NLA patches · fa3 · no radix cache
eval
1000 acts · UFW en 100k–100.2k · seed 42 · FVE
0.775
Gemma FVE (ours)
released 0.763 — consistent
0.752
Qwen FVE (ours)
released 0.752 — exact
0.0575
eval Var(v_nrm)
train 0.0579 — third decimal
0.013022
greedy NMSE, twice
bit-reproducible decode
§1

Protocol & three-way corroboration

One fixed evaluation set throughout: 1000 activations from Ultra-FineWeb endocuments 100000–100200 (held out — the released checkpoints trained on docs 0–100000), sampled with the repo’s own stage-0 logic (per-doc keyed RNG, seed 42, 5 positions/doc, min position 50), extracted at layer 41 in the training convention — output of decoder block 41 = HF hidden_states[42]. Scoring follows the released recipe exactly: greedy AV decode, explanation extracted as in the training reward, AR forward at the last token, both vectors L2-normalized to √d, NMSE = 2(1−cos).

One recurring caveat: L41 is a low-variance layer (Var ≈ 0.058). Cosine compresses into 0.95–0.99 for everything from near-perfect to actively-poisoned; FVE is the discriminative metric throughout.

metricours (UFW held-out)released
fve_nrm (train denom)0.7750.763 (50/50 WildChat+UFW)
fve_nrm (own denom)0.774
NMSE mean / median0.0130 / 0.0111
eval-set Var(v_nrm)0.05750.0579 (train)
explanation-tag failures0/1000

0.775 vs 0.763 is consistency, not exact replication — no WildChat half in our eval; mid-document UFW prose is the clean end of their distribution.

CorroborationThree independent checks pass: the Gemma headline FVE (0.775 vs 0.763), the variance baseline matching to the third decimal (0.0575 vs 0.0579) — strong evidence the entire extraction pipeline (layer indexing, tokenization, raw-vector convention) matches theirs — and the Qwen pair reproducing its released number exactly: 0.752 vs 0.752 (§5). Greedy decoding is bit-reproducible: NMSE 0.013022 on both runs.
§2

Quote ablations & the verbatim check

AV explanations lean heavily on quotation-formatted spans. Banning quote tokens at generation time (SGLang logit_bias −1e9; 0 banned-character leaks, fluency intact) tests whether fidelity is smuggled through verbatim copying.

conditionNMSEcosFVE (train denom)
unconstrained baseline0.01300.99350.775
ban "-family tokens (975 of 262k vocab)0.01410.99300.757
ban "+' families (1651 tokens)0.02220.98890.616
└ hatch-free subset (n=789)0.02300.98850.603
└ hatch-using subset (n=209)0.01950.99030.664

Ban only double quotes and the policy substitutes: 982/1000 decodes switched to '…'. Ban both families and 21% still reach for corner brackets 「」, *italic spans*, parenthesized verbatim, or em-dashes.

FindingGenuine paraphrase-only decoding holds 0.603 — ~78% of reconstruction fidelity survives without any quoting mechanism; quoting accounts for ~0.17 FVE. A lower bound: the AR was trained on quote-ful explanations, so part of the drop is AR-side format OOD.

And the quotes were never verbatim anyway. Longest contiguous word n-gram shared between each explanation and its actual source context, vs a shuffled-context null:

decode setmedian runp90≥6-word runsnull ≥6
quote-free (round 2, hatch-free)041%0%
quote-using (round 1, ' substitution)042%0%

Worst offender across 789 decodes: one 7-word boilerplate run.

FindingEven when quoting freely, the AV almost never reproduces the true context verbatim (median: not even a shared 3-gram). Architecturally obvious in hindsight — the AV never sees text, only one 5376-dim vector. Every “quote” is a confabulated reconstruction dressed in quotation marks; there is no copy channel. The quote ban above is a format ablation, not a content ablation.
§3

Post-hoc redaction: where the signal lives

The AV decodes normally (fully in-distribution); the AR’s input is edited before scoring. Mean 9.4 quoted spans per decode, ~23% of explanation characters.

AR input variantNMSEcosFVEfve<0 samples
full explanation0.01300.99350.7750.5%
spans + quote marks deleted (prose only)0.06330.9684−0.0957%
spans → ""0.07810.9610−0.3581%
spans → …0.08830.9559−0.5392%
spans → REDACTED0.10190.9491−0.7696%
quoted spans only, marks kept0.02860.98570.5066%
quoted spans only, bare text0.04650.97680.19621%

All post-hoc rows are lower bounds — every edit is AR-side OOD.

FindingProse alone carries ≈ zero standalone signal (−0.09 ≈ predict-the-mean) despite being 77% of the characters; junkier fillers actively misdirect the AR (monotone −0.09 → −0.76). Spans alone recover 0.506 — and 0.31 of that is the quote marks themselves(0.506 with marks vs 0.196 bare): the AR learned the delimiters as type-markers. The halves are entangled, not payload+decoration — prose ≈ 0 alone, spans = 0.506 alone, together 0.775; the prose’s ~0.27 contribution is relational glue. Contrast with §2: removed at generation time, the AV re-packs content into prose (0.616); removed post-hoc, the content is simply gone.
§4

Best-of-N & adjacent layers

Ten temp-1.0 samples per activation, best-of-prefix-N selected by the same AR that defines the metric — the training reward used as a test-time verifier. Zero tag failures in 10,000 sampled decodes.

selectionNMSEcosFVE
greedy0.01300.99350.775
1 sample (t=1.0)0.01320.99340.772
best of 2 / 3 / 50.0125 / 0.0122 / 0.01200.784 / 0.789 / 0.794
best of 100.01160.99420.799
mean over all 10 samples0.01320.772
FindingA single sample ≈ greedy; ten-way selection buys +0.027 total, decaying log-linearly (~+0.01/doubling). GRPO (8 samples/prompt in training) already squeezed out the selection slack; the self-dealing concern is mooted by the tiny gains. The residual ~0.20 of unexplained variance is structural, not sampling noise — these checkpoints operate near their own ceiling, and more AV test-time compute is the wrong axis.
inputcos to L41NMSEFVE (own denom)
L41 (trained)0.01300.774
L400.99700.01670.695
L420.99710.01870.703

Raw norms (56.0k–58.3k) stay within the injection_scale=60000 distribution. Caveat: at cos 0.997 the neighbors barely differ informationally; ±10-block jumps would be the sharper test (not run).

FindingNeither observed NMSE exceeds the additive bound (intrinsic 0.013 + inter-layer offset 0.006 = 0.019): the entire FVE drop is accounted for by the geometry between layers. Input-side robustness contrasts sharply with the text-side brittleness of §3, where quote marks alone are worth 0.31 FVE.
§5

Cross-family transfer & the ridge baseline

Both models’ activations extracted at the same document positions, aligned across tokenizers by character offset (mean alignment error 0.32 chars). Each cell scores one family’s AV explanations through the other family’s AR, against the AR-side family’s own gold; own-denominator FVE, shuffled-pair nulls.

routeFVE (own denom)shuffled null
within: AV_gemma → AR_gemma0.774−0.76
within: AV_qwen → AR_qwen0.752 (released 0.752 — exact)−0.91
cross: AV_gemma → AR_qwen0.592−0.89
cross: AV_qwen → AR_gemma0.528−0.66

Asymmetry: the 27B AV writes richer explanations that even a foreign AR can use; Gemma-L41 is also the harder target (normalized Var 0.0575 vs Qwen's 0.712 — the same angular error costs ~12× more FVE).

FindingCross-family transfer reaches 79% of the within-family ceiling (gemma→qwen: 0.592/0.752) and 68% in reverse. The NL-expressible content of residual streams is substantially model-agnostic across architecture, tokenizer, and a 4× scale gap. Deeply negative nulls: the signal is position-specific, not corpus/style prior. Both cross cells are lower bounds (foreign-AV phrasing is mildly OOD for each AR).

How good is that, really? Kernel ridge gemma-L41 → qwen-L20 (and reverse) on unit-normalized vectors, trained on fresh disjoint pairs (UFW docs 102000–106000, 19,995 aligned pairs), evaluated on the same 1000-pair eval set; alpha selected on an internal train split.

Fig. 1 Ridge FVE vs training-pair count (log x) over the full n-grid 1k/2k/5k/10k/19,995, against the zero-paired-data NLA text route (dashed reference lines). Per-doubling ridge gains decelerate (+0.083 → +0.054 for gemma→qwen); log-linear extrapolation puts the crossover at ~10⁵ pairs for gemma→qwen and ~10⁷ — effectively never — for qwen→gemma.
FindingThe zero-paired-data text route beats a 20k-pair supervised linear map in both directions: the NL bottleneck is worth ≳100k supervised pairs. Caveats: single 1000-pair eval set (±~0.02); the extrapolation is trend-reading; ridge is the linear class only — a small MLP adapter on 20k pairs would likely land between 0.42 and 0.59.
§6

Synthesis

1 · The release is solidThree independent corroborations (Gemma FVE, Qwen FVE to the third decimal, variance baselines), bit-reproducible greedy decoding, zero format failures in ~25k decodes, sidecar-as-contract caught no drift. The published numbers also sit near the system’s ceiling (best-of-10 = 0.799): they are honest, not cherry-picked.
2 · A structured document, not free proseThe code the AV/AR pair co-evolved: confabulated context-reconstructions in quote-delimited spans (the payload), quote marks as learned type-markers (0.31 FVE by themselves), descriptive prose as relational glue (~zero standalone, ~0.27 in combination). Fidelity metrics validate this whole document; human readers mostly read the prose — the part with the least standalone signal. Faithfulness of the descriptive claims is not what fve_nrm certifies.
3 · Content decoding is real and dominantShuffled-format-identical explanations score −0.76 to −0.91; cross-family transfer through a foreign AR reaches 0.59. Any “it’s just format matching” account fails both tests.
4 · Robustness is asymmetricGraceful on the vector side — adjacent layers cost exactly the geometric offset — brittle on the text side: delimiters and format conventions carry disproportionate weight.
5 · The policy is convergedTest-time selection headroom ≈ 0.027 FVE. Improvements must come from training/architecture (longer explanations, agent-trace data, bigger AR), not sampling.
6 · Language is a strong interlinguaNatural language beats 20k-pair supervised linear alignment as a cross-model representation transporter, with zero paired training data.
§7

The OpenInterp “two-tier verbalization” critique

The paper (openinterp.org, same kitft checkpoints) claims fve_nrm measures only format selection (“Tier 1”), not content (“Tier 2”), based on flat fve (~0.98) vs 6–9× keyword-recall spread across 150 chat-template-final-token activations. Our measurements split the critique cleanly into a part worth keeping and a part that doesn’t survive contact.

Fig. 2 The content question in one chart. If fve_nrm measured only format selection, format-identical shuffled explanations would score near the true ones — instead they lose ~1.5 FVE, landing at −0.76 to −0.91; and a foreign-family AR recovers 0.53–0.59 from the text alone. Note OpenInterp's own eval slice sits at fve ≈ 0.98 vs the released 0.763 / our 0.775 — a saturated single-position slice, where flatness at a compressed ceiling is unsurprising.
KeepThe coverage finding is real and useful— agent-task activations decode at the noise floor because the training mix (WildChat+UFW) contains no agent traces, and fve_nrm won’t warn you out-of-distribution. The template-attractor observation is realtoo: we see the “Educational/structured article format…” opener everywhere. Worth adopting: stratified content metrics, random-Gaussian baselines, agent-trace training data.
DiscardThe headline claim “cannot content-decode” is refuted— by our shuffled nulls (format-identical permuted explanations lose 1.5 FVE), by cross-family transfer, and by the paper’s own growing permutation gaps. Their fve ≈ 0.98 indicates a saturated single-position eval slice; flatness at a compressed ceiling is then unsurprising. The two-tier grand theory goes with it.
§8

Reproducibility inventory

artifactpath
eval vectors (gemma L41 / L40 / L42, qwen L20)fve_eval/vecs_1000*.npy + meta_1000.json
decodes (baseline, quote-bans, best-of-N, qwen, layers)fve_eval/decodes_*.json
per-experiment resultsfve_eval/fve_*_results.json, ridge_*.json
ridge training pairs (2×~480MB)fve_eval/ridge_train_{gemma,qwen}.npy
scriptsfve_fast.py · fve_noquote{,2}.py · fve_redacted.py · fve_bestofn.py · fve_layers.py · fve_crossfamily.py · ridge_scaling.py · ridge_full_n.py · extract_activations.py · run_nla.py · convert_to_bf16.py

Operational gotchas encountered, for whoever re-runs this: