Probing the released Gemma-3-27B NLA: corroboration, ablations, transfer
The full writeup of the series — three independent corroborations of the release, the anatomy of the explanation code, the cross-family interlingua result, and a verdict on the OpenInterp critique.
- date
- 2026-06-10/11
- hardware
- 2×H200 141GB · 2TB RAM
- serving
- SGLang 0.5.9 + NLA patches · fa3 · no radix cache
- eval
- 1000 acts · UFW en 100k–100.2k · seed 42 · FVE
Protocol & three-way corroboration
One fixed evaluation set throughout: 1000 activations from Ultra-FineWeb endocuments 100000–100200 (held out — the released checkpoints trained on docs 0–100000), sampled with the repo’s own stage-0 logic (per-doc keyed RNG, seed 42, 5 positions/doc, min position 50), extracted at layer 41 in the training convention — output of decoder block 41 = HF hidden_states[42]. Scoring follows the released recipe exactly: greedy AV decode, explanation extracted as in the training reward, AR forward at the last token, both vectors L2-normalized to √d, NMSE = 2(1−cos).
One recurring caveat: L41 is a low-variance layer (Var ≈ 0.058). Cosine compresses into 0.95–0.99 for everything from near-perfect to actively-poisoned; FVE is the discriminative metric throughout.
| metric | ours (UFW held-out) | released |
|---|---|---|
| fve_nrm (train denom) | 0.775 | 0.763 (50/50 WildChat+UFW) |
| fve_nrm (own denom) | 0.774 | — |
| NMSE mean / median | 0.0130 / 0.0111 | — |
| eval-set Var(v_nrm) | 0.0575 | 0.0579 (train) |
| explanation-tag failures | 0/1000 | — |
0.775 vs 0.763 is consistency, not exact replication — no WildChat half in our eval; mid-document UFW prose is the clean end of their distribution.
Quote ablations & the verbatim check
AV explanations lean heavily on quotation-formatted spans. Banning quote tokens at generation time (SGLang logit_bias −1e9; 0 banned-character leaks, fluency intact) tests whether fidelity is smuggled through verbatim copying.
| condition | NMSE | cos | FVE (train denom) |
|---|---|---|---|
| unconstrained baseline | 0.0130 | 0.9935 | 0.775 |
| ban "-family tokens (975 of 262k vocab) | 0.0141 | 0.9930 | 0.757 |
| ban "+' families (1651 tokens) | 0.0222 | 0.9889 | 0.616 |
| └ hatch-free subset (n=789) | 0.0230 | 0.9885 | 0.603 |
| └ hatch-using subset (n=209) | 0.0195 | 0.9903 | 0.664 |
Ban only double quotes and the policy substitutes: 982/1000 decodes switched to '…'. Ban both families and 21% still reach for corner brackets 「」, *italic spans*, parenthesized verbatim, or em-dashes.
And the quotes were never verbatim anyway. Longest contiguous word n-gram shared between each explanation and its actual source context, vs a shuffled-context null:
| decode set | median run | p90 | ≥6-word runs | null ≥6 |
|---|---|---|---|---|
| quote-free (round 2, hatch-free) | 0 | 4 | 1% | 0% |
| quote-using (round 1, ' substitution) | 0 | 4 | 2% | 0% |
Worst offender across 789 decodes: one 7-word boilerplate run.
Post-hoc redaction: where the signal lives
The AV decodes normally (fully in-distribution); the AR’s input is edited before scoring. Mean 9.4 quoted spans per decode, ~23% of explanation characters.
| AR input variant | NMSE | cos | FVE | fve<0 samples |
|---|---|---|---|---|
| full explanation | 0.0130 | 0.9935 | 0.775 | 0.5% |
| spans + quote marks deleted (prose only) | 0.0633 | 0.9684 | −0.09 | 57% |
| spans → "" | 0.0781 | 0.9610 | −0.35 | 81% |
| spans → … | 0.0883 | 0.9559 | −0.53 | 92% |
| spans → REDACTED | 0.1019 | 0.9491 | −0.76 | 96% |
| quoted spans only, marks kept | 0.0286 | 0.9857 | 0.506 | 6% |
| quoted spans only, bare text | 0.0465 | 0.9768 | 0.196 | 21% |
All post-hoc rows are lower bounds — every edit is AR-side OOD.
Best-of-N & adjacent layers
Ten temp-1.0 samples per activation, best-of-prefix-N selected by the same AR that defines the metric — the training reward used as a test-time verifier. Zero tag failures in 10,000 sampled decodes.
| selection | NMSE | cos | FVE |
|---|---|---|---|
| greedy | 0.0130 | 0.9935 | 0.775 |
| 1 sample (t=1.0) | 0.0132 | 0.9934 | 0.772 |
| best of 2 / 3 / 5 | 0.0125 / 0.0122 / 0.0120 | — | 0.784 / 0.789 / 0.794 |
| best of 10 | 0.0116 | 0.9942 | 0.799 |
| mean over all 10 samples | 0.0132 | — | 0.772 |
| input | cos to L41 | NMSE | FVE (own denom) |
|---|---|---|---|
| L41 (trained) | — | 0.0130 | 0.774 |
| L40 | 0.9970 | 0.0167 | 0.695 |
| L42 | 0.9971 | 0.0187 | 0.703 |
Raw norms (56.0k–58.3k) stay within the injection_scale=60000 distribution. Caveat: at cos 0.997 the neighbors barely differ informationally; ±10-block jumps would be the sharper test (not run).
Cross-family transfer & the ridge baseline
Both models’ activations extracted at the same document positions, aligned across tokenizers by character offset (mean alignment error 0.32 chars). Each cell scores one family’s AV explanations through the other family’s AR, against the AR-side family’s own gold; own-denominator FVE, shuffled-pair nulls.
| route | FVE (own denom) | shuffled null |
|---|---|---|
| within: AV_gemma → AR_gemma | 0.774 | −0.76 |
| within: AV_qwen → AR_qwen | 0.752 (released 0.752 — exact) | −0.91 |
| cross: AV_gemma → AR_qwen | 0.592 | −0.89 |
| cross: AV_qwen → AR_gemma | 0.528 | −0.66 |
Asymmetry: the 27B AV writes richer explanations that even a foreign AR can use; Gemma-L41 is also the harder target (normalized Var 0.0575 vs Qwen's 0.712 — the same angular error costs ~12× more FVE).
How good is that, really? Kernel ridge gemma-L41 → qwen-L20 (and reverse) on unit-normalized vectors, trained on fresh disjoint pairs (UFW docs 102000–106000, 19,995 aligned pairs), evaluated on the same 1000-pair eval set; alpha selected on an internal train split.
Synthesis
The OpenInterp “two-tier verbalization” critique
The paper (openinterp.org, same kitft checkpoints) claims fve_nrm measures only format selection (“Tier 1”), not content (“Tier 2”), based on flat fve (~0.98) vs 6–9× keyword-recall spread across 150 chat-template-final-token activations. Our measurements split the critique cleanly into a part worth keeping and a part that doesn’t survive contact.
Reproducibility inventory
| artifact | path |
|---|---|
| eval vectors (gemma L41 / L40 / L42, qwen L20) | fve_eval/vecs_1000*.npy + meta_1000.json |
| decodes (baseline, quote-bans, best-of-N, qwen, layers) | fve_eval/decodes_*.json |
| per-experiment results | fve_eval/fve_*_results.json, ridge_*.json |
| ridge training pairs (2×~480MB) | fve_eval/ridge_train_{gemma,qwen}.npy |
| scripts | fve_fast.py · fve_noquote{,2}.py · fve_redacted.py · fve_bestofn.py · fve_layers.py · fve_crossfamily.py · ridge_scaling.py · ridge_full_n.py · extract_activations.py · run_nla.py · convert_to_bf16.py |
Operational gotchas encountered, for whoever re-runs this:
- fp32 → bf16released 27B AV ships fp32 — convert before serving (convert_to_bf16.py)
- off-by-onerepo layer_index=K = HF hidden_states[K+1]; the README quick-demo is one block early
- patch anchor driftSGLang 0.5.9 drifts one NLA patch anchor (the retract fix) — hand-adapt it
- attention backendGemma needs --attention-backend fa3 (plus --disable-radix-cache)
- payload prebuildingprebuild request payloads once per vector — per-request embed building starves the async client at 10k-request scale
- HF_HOMEHF token lives at ~/.cache/huggingface/token but HF_HOME points to /workspace/.hf_home — copy it
- ephemeral boxworkspace_is_volume: false — nothing on the instance survives recycle/destroy; push artifacts off-box