← NLA experimentsNatural Language Autoencoders · follow-up series

Probing the released Gemma-3-27B NLA: corroboration, ablations, transfer

The full writeup of the series — three independent corroborations of the release, the anatomy of the explanation code, the cross-family interlingua result, and a verdict on the OpenInterp critique.

date: 2026-06-10/11
hardware: 2×H200 141GB · 2TB RAM
serving: SGLang 0.5.9 + NLA patches · fa3 · no radix cache
eval: 1000 acts · UFW en 100k–100.2k · seed 42 · FVE

0.775

Gemma FVE (ours)

released 0.763 — consistent

0.752

Qwen FVE (ours)

released 0.752 — exact

0.0575

eval Var(v_nrm)

train 0.0579 — third decimal

0.013022

greedy NMSE, twice

bit-reproducible decode

§1

Protocol & three-way corroboration

reproduced

One fixed evaluation set throughout: 1000 activations from Ultra-FineWeb endocuments 100000–100200 (held out — the released checkpoints trained on docs 0–100000), sampled with the repo’s own stage-0 logic (per-doc keyed RNG, seed 42, 5 positions/doc, min position 50), extracted at layer 41 in the training convention — output of decoder block 41 = HF hidden_states[42]. Scoring follows the released recipe exactly: greedy AV decode, explanation extracted as in the training reward, AR forward at the last token, both vectors L2-normalized to √d, NMSE = 2(1−cos).

One recurring caveat: L41 is a low-variance layer (Var ≈ 0.058). Cosine compresses into 0.95–0.99 for everything from near-perfect to actively-poisoned; FVE is the discriminative metric throughout.

metric	ours (UFW held-out)	released
fve_nrm (train denom)	0.775	0.763 (50/50 WildChat+UFW)
fve_nrm (own denom)	0.774	—
NMSE mean / median	0.0130 / 0.0111	—
eval-set Var(v_nrm)	0.0575	0.0579 (train)
explanation-tag failures	0/1000	—

0.775 vs 0.763 is consistency, not exact replication — no WildChat half in our eval; mid-document UFW prose is the clean end of their distribution.

Corroboration —Three independent checks pass: the Gemma headline FVE (0.775 vs 0.763), the variance baseline matching to the third decimal (0.0575 vs 0.0579) — strong evidence the entire extraction pipeline (layer indexing, tokenization, raw-vector convention) matches theirs — and the Qwen pair reproducing its released number exactly: 0.752 vs 0.752 (§5). Greedy decoding is bit-reproducible: NMSE 0.013022 on both runs.

§2

Quote ablations & the verbatim check

compact recap

AV explanations lean heavily on quotation-formatted spans. Banning quote tokens at generation time (SGLang logit_bias −1e9; 0 banned-character leaks, fluency intact) tests whether fidelity is smuggled through verbatim copying.

condition	NMSE	cos	FVE (train denom)
unconstrained baseline	0.0130	0.9935	0.775
ban "-family tokens (975 of 262k vocab)	0.0141	0.9930	0.757
ban "+' families (1651 tokens)	0.0222	0.9889	0.616
└ hatch-free subset (n=789)	0.0230	0.9885	0.603
└ hatch-using subset (n=209)	0.0195	0.9903	0.664

Ban only double quotes and the policy substitutes: 982/1000 decodes switched to '…'. Ban both families and 21% still reach for corner brackets 「」, *italic spans*, parenthesized verbatim, or em-dashes.

Finding —Genuine paraphrase-only decoding holds 0.603 — ~78% of reconstruction fidelity survives without any quoting mechanism; quoting accounts for ~0.17 FVE. A lower bound: the AR was trained on quote-ful explanations, so part of the drop is AR-side format OOD.

And the quotes were never verbatim anyway. Longest contiguous word n-gram shared between each explanation and its actual source context, vs a shuffled-context null:

decode set	median run	p90	≥6-word runs	null ≥6
quote-free (round 2, hatch-free)	0	4	1%	0%
quote-using (round 1, ' substitution)	0	4	2%	0%

Worst offender across 789 decodes: one 7-word boilerplate run.

Finding —Even when quoting freely, the AV almost never reproduces the true context verbatim (median: not even a shared 3-gram). Architecturally obvious in hindsight — the AV never sees text, only one 5376-dim vector. Every “quote” is a confabulated reconstruction dressed in quotation marks; there is no copy channel. The quote ban above is a format ablation, not a content ablation.

§3

Post-hoc redaction: where the signal lives

AR input surgery

The AV decodes normally (fully in-distribution); the AR’s input is edited before scoring. Mean 9.4 quoted spans per decode, ~23% of explanation characters.

AR input variant	NMSE	cos	FVE	fve<0 samples
full explanation	0.0130	0.9935	0.775	0.5%
spans + quote marks deleted (prose only)	0.0633	0.9684	−0.09	57%
spans → ""	0.0781	0.9610	−0.35	81%
spans → …	0.0883	0.9559	−0.53	92%
spans → REDACTED	0.1019	0.9491	−0.76	96%
quoted spans only, marks kept	0.0286	0.9857	0.506	6%
quoted spans only, bare text	0.0465	0.9768	0.196	21%

All post-hoc rows are lower bounds — every edit is AR-side OOD.

Finding —Prose alone carries ≈ zero standalone signal (−0.09 ≈ predict-the-mean) despite being 77% of the characters; junkier fillers actively misdirect the AR (monotone −0.09 → −0.76). Spans alone recover 0.506 — and 0.31 of that is the quote marks themselves(0.506 with marks vs 0.196 bare): the AR learned the delimiters as type-markers. The halves are entangled, not payload+decoration — prose ≈ 0 alone, spans = 0.506 alone, together 0.775; the prose’s ~0.27 contribution is relational glue. Contrast with §2: removed at generation time, the AV re-packs content into prose (0.616); removed post-hoc, the content is simply gone.

§4

Best-of-N & adjacent layers

compact recap

Ten temp-1.0 samples per activation, best-of-prefix-N selected by the same AR that defines the metric — the training reward used as a test-time verifier. Zero tag failures in 10,000 sampled decodes.

selection	NMSE	cos	FVE
greedy	0.0130	0.9935	0.775
1 sample (t=1.0)	0.0132	0.9934	0.772
best of 2 / 3 / 5	0.0125 / 0.0122 / 0.0120	—	0.784 / 0.789 / 0.794
best of 10	0.0116	0.9942	0.799
mean over all 10 samples	0.0132	—	0.772

Finding —A single sample ≈ greedy; ten-way selection buys +0.027 total, decaying log-linearly (~+0.01/doubling). GRPO (8 samples/prompt in training) already squeezed out the selection slack; the self-dealing concern is mooted by the tiny gains. The residual ~0.20 of unexplained variance is structural, not sampling noise — these checkpoints operate near their own ceiling, and more AV test-time compute is the wrong axis.

input	cos to L41	NMSE	FVE (own denom)
L41 (trained)	—	0.0130	0.774
L40	0.9970	0.0167	0.695
L42	0.9971	0.0187	0.703

Raw norms (56.0k–58.3k) stay within the injection_scale=60000 distribution. Caveat: at cos 0.997 the neighbors barely differ informationally; ±10-block jumps would be the sharper test (not run).

Finding —Neither observed NMSE exceeds the additive bound (intrinsic 0.013 + inter-layer offset 0.006 = 0.019): the entire FVE drop is accounted for by the geometry between layers. Input-side robustness contrasts sharply with the text-side brittleness of §3, where quote marks alone are worth 0.31 FVE.

§5

Cross-family transfer & the ridge baseline

Gemma-27B ⇄ Qwen-7B

Both models’ activations extracted at the same document positions, aligned across tokenizers by character offset (mean alignment error 0.32 chars). Each cell scores one family’s AV explanations through the other family’s AR, against the AR-side family’s own gold; own-denominator FVE, shuffled-pair nulls.

route	FVE (own denom)	shuffled null
within: AV_gemma → AR_gemma	0.774	−0.76
within: AV_qwen → AR_qwen	0.752 (released 0.752 — exact)	−0.91
cross: AV_gemma → AR_qwen	0.592	−0.89
cross: AV_qwen → AR_gemma	0.528	−0.66

Asymmetry: the 27B AV writes richer explanations that even a foreign AR can use; Gemma-L41 is also the harder target (normalized Var 0.0575 vs Qwen's 0.712 — the same angular error costs ~12× more FVE).

Finding —Cross-family transfer reaches 79% of the within-family ceiling (gemma→qwen: 0.592/0.752) and 68% in reverse. The NL-expressible content of residual streams is substantially model-agnostic across architecture, tokenizer, and a 4× scale gap. Deeply negative nulls: the signal is position-specific, not corpus/style prior. Both cross cells are lower bounds (foreign-AV phrasing is mildly OOD for each AR).

How good is that, really? Kernel ridge gemma-L41 → qwen-L20 (and reverse) on unit-normalized vectors, trained on fresh disjoint pairs (UFW docs 102000–106000, 19,995 aligned pairs), evaluated on the same 1000-pair eval set; alpha selected on an internal train split.

Fig. 1 — Ridge FVE vs training-pair count (log x) over the full n-grid 1k/2k/5k/10k/19,995, against the zero-paired-data NLA text route (dashed reference lines). Per-doubling ridge gains decelerate (+0.083 → +0.054 for gemma→qwen); log-linear extrapolation puts the crossover at ~10⁵ pairs for gemma→qwen and ~10⁷ — effectively never — for qwen→gemma.

Finding —The zero-paired-data text route beats a 20k-pair supervised linear map in both directions: the NL bottleneck is worth ≳100k supervised pairs. Caveats: single 1000-pair eval set (±~0.02); the extrapolation is trend-reading; ridge is the linear class only — a small MLP adapter on 20k pairs would likely land between 0.42 and 0.59.

§6

Synthesis

six claims

1 · The release is solid —Three independent corroborations (Gemma FVE, Qwen FVE to the third decimal, variance baselines), bit-reproducible greedy decoding, zero format failures in ~25k decodes, sidecar-as-contract caught no drift. The published numbers also sit near the system’s ceiling (best-of-10 = 0.799): they are honest, not cherry-picked.

2 · A structured document, not free prose —The code the AV/AR pair co-evolved: confabulated context-reconstructions in quote-delimited spans (the payload), quote marks as learned type-markers (0.31 FVE by themselves), descriptive prose as relational glue (~zero standalone, ~0.27 in combination). Fidelity metrics validate this whole document; human readers mostly read the prose — the part with the least standalone signal. Faithfulness of the descriptive claims is not what fve_nrm certifies.

3 · Content decoding is real and dominant —Shuffled-format-identical explanations score −0.76 to −0.91; cross-family transfer through a foreign AR reaches 0.59. Any “it’s just format matching” account fails both tests.

4 · Robustness is asymmetric —Graceful on the vector side — adjacent layers cost exactly the geometric offset — brittle on the text side: delimiters and format conventions carry disproportionate weight.

5 · The policy is converged —Test-time selection headroom ≈ 0.027 FVE. Improvements must come from training/architecture (longer explanations, agent-trace data, bigger AR), not sampling.

6 · Language is a strong interlingua —Natural language beats 20k-pair supervised linear alignment as a cross-model representation transporter, with zero paired training data.

§7

The OpenInterp “two-tier verbalization” critique

assessed against our data

The paper (openinterp.org, same kitft checkpoints) claims fve_nrm measures only format selection (“Tier 1”), not content (“Tier 2”), based on flat fve (~0.98) vs 6–9× keyword-recall spread across 150 chat-template-final-token activations. Our measurements split the critique cleanly into a part worth keeping and a part that doesn’t survive contact.

Fig. 2 — The content question in one chart. If fve_nrm measured only format selection, format-identical shuffled explanations would score near the true ones — instead they lose ~1.5 FVE, landing at −0.76 to −0.91; and a foreign-family AR recovers 0.53–0.59 from the text alone. Note OpenInterp's own eval slice sits at fve ≈ 0.98 vs the released 0.763 / our 0.775 — a saturated single-position slice, where flatness at a compressed ceiling is unsurprising.

Keep —The coverage finding is real and useful— agent-task activations decode at the noise floor because the training mix (WildChat+UFW) contains no agent traces, and fve_nrm won’t warn you out-of-distribution. The template-attractor observation is realtoo: we see the “Educational/structured article format…” opener everywhere. Worth adopting: stratified content metrics, random-Gaussian baselines, agent-trace training data.

Discard —The headline claim “cannot content-decode” is refuted— by our shuffled nulls (format-identical permuted explanations lose 1.5 FVE), by cross-family transfer, and by the paper’s own growing permutation gaps. Their fve ≈ 0.98 indicates a saturated single-position eval slice; flatness at a compressed ceiling is then unsurprising. The two-tier grand theory goes with it.

§8

Reproducibility inventory

artifacts & gotchas

artifact	path
eval vectors (gemma L41 / L40 / L42, qwen L20)	fve_eval/vecs_1000*.npy + meta_1000.json
decodes (baseline, quote-bans, best-of-N, qwen, layers)	fve_eval/decodes_*.json
per-experiment results	fve_eval/fve__results.json, ridge_.json
ridge training pairs (2×~480MB)	fve_eval/ridge_train_{gemma,qwen}.npy
scripts	fve_fast.py · fve_noquote{,2}.py · fve_redacted.py · fve_bestofn.py · fve_layers.py · fve_crossfamily.py · ridge_scaling.py · ridge_full_n.py · extract_activations.py · run_nla.py · convert_to_bf16.py

Operational gotchas encountered, for whoever re-runs this:

fp32 → bf16released 27B AV ships fp32 — convert before serving (convert_to_bf16.py)
off-by-onerepo layer_index=K = HF hidden_states[K+1]; the README quick-demo is one block early
patch anchor driftSGLang 0.5.9 drifts one NLA patch anchor (the retract fix) — hand-adapt it
attention backendGemma needs --attention-backend fa3 (plus --disable-radix-cache)
payload prebuildingprebuild request payloads once per vector — per-request embed building starves the async client at 10k-request scale
HF_HOMEHF token lives at ~/.cache/huggingface/token but HF_HOME points to /workspace/.hf_home — copy it
ephemeral boxworkspace_is_volume: false — nothing on the instance survives recycle/destroy; push artifacts off-box