← NLA experimentsNatural Language Autoencoders · follow-up series
05

Stratified error analysis: where the 0.775 lives

Per-sample decomposition of the Gemma-3-27B-L41 NLA's reconstruction error, built entirely from existing artifacts — no model forwards, no GPU.

date
2026-06-11
hardware
CPU only — GPUs untouched
inputs
bestofn_mses.npy (1000×10) · vecs_1000 · meta_1000 · decodes
artifacts
strat_features.json · strat_results.json
0.799
word-start FVE
n=729 — the easy bulk
0.561
digit-token FVE
n=39 — the true weak spot
+0.672
Spearman r, atypicality
…but mostly mechanical (§2)
0.556
cv-ρ² predictability
error rank, before any decode
§0

Method

Per-activation error = mean of the 10 temp-1.0 sample NMSEs from the best-of-N run (mean-of-10 FVE 0.772 ≈ greedy 0.775). Bucket FVEs use the train denominator 0.0579 as throughout the series. The 200 eval docs (UFW en 100000–100199) were re-streamed and positions re-derived with the repo’s sampler (seed 42, 5/doc, min 50): all 1000 (doc, pos) pairs matched meta_1000.json exactly, and the recomputed variance baseline reproduced (0.0575).

Thirteen features per sample, in four families: gold geometry (atypicality = per-sample predict-the-mean error, whose mean 0.0575 is the FVE denominator — so it is each sample’s own baseline; raw norm; doc-mate cos), token (word-start vs subword, punctuation, digit, length, Zipf frequency, POS via spaCy on the 240-char context tail), document (length, relative position, alpha fraction, line length), and explanation (char length, n quoted spans, quoted fraction).

§1

What correlates with error

Fig. 1 Spearman correlation of per-activation NMSE with each feature. Gold-vector geometry dominates (atypicality +0.672, raw norm −0.645, doc-mate cos −0.503); document-position features are noise-level; expl_nspans is not significant.
§2

Atypicality: big gradient, mostly mechanical

FVE by atypicality quartile under the global denominator falls from 0.861 to 0.651 — apparently a steep difficulty gradient. But each sample’s atypicality is its own predict-the-mean baseline, so far-from-mean samples have more variance to explain by construction. Rescoring each sample against its own baseline (own-FVE = 1 − msei/atypi) nearly flattens it.

FVE, global denominator
Q1
Q2
Q3
Q4
FVE, own baseline (1 − mse/atyp)
Q1
Q2
Q3
Q4
0.6FVE0.9
Fig. 2 FVE by atypicality quartile (Q1 most typical → Q4 most atypical). Global denominator: 0.861 / 0.814 / 0.762 / 0.651. Own-baseline: 0.796 / 0.779 / 0.762 / 0.760 — the gradient nearly vanishes. Spearman(atyp, own-FVE) = −0.14: a real but small residual penalty.
FindingMost of the r=+0.67 is denominator mechanics; the AV’s proportional fidelity is nearly flat across the activation distribution — it tracks unusual activations almost as well, proportionally, as typical ones. Robustness echo of the adjacent-layer result (REPORT_01 §6): the vector input side degrades gracefully.
§3

Token-type strata: the digit hole

Fig. 3 FVE by token type. Word-start tokens (the bulk of the eval) decode at 0.799; digit tokens at 0.561 — the channel's true weak spot. Dashed line: the 0.775 headline.

POS among word-start tokens is flat — the verbalization channel handles all word classes equally; its weak spot is non-lexical material. Only PROPN dips.

word-start POS classes
PROPN
ADP
PRON
NOUN
VERB
ADJ
ADV
AUX
0.7FVE0.9
Fig. 4 FVE by POS among word-start tokens, on a deliberately tight domain. NOUN 0.796, VERB 0.802, ADJ 0.807, ADV 0.823, AUX 0.845, ADP 0.788, PRON 0.793 cluster tightly; PROPN (0.759) is the only dip. Cells under n=15 suppressed.

The worst-12 list says it plainly — citation, date, and numeric contexts dominate:

FindingBest-6 samples are function/content words in clean argumentative prose (mse ≈ 0.003–0.004, FVE > 0.93). Coherent with REPORT_01 §3: spans are semantic reconstructions, lexically inexact — a 148-token confabulated-span explanation cannot pin arbitrary digit strings; that is exactly the content semantic paraphrase cannot carry. Direct follow-up: idea #12, random-string channel capacity.
§4

Document effects: cleanliness yes, depth no

doc alpha-frac (dirty → clean)
Q1
Q2
Q3
Q4
doc length (72 → 2048 tok)
Q1
Q2
Q3
Q4
relative position in doc
Q1
Q2
Q3
Q4
0.7FVE0.8
Fig. 5 FVE by document-feature quartile, shared domain. Alpha-fraction (dirty→clean) spans 0.713→0.819 — boilerplate/listy docs cost ~0.11 FVE. Doc length (72→2048 tok) and relative position are near-flat (≤0.03 spread).
FindingThe ~0.11 FVE cleanliness cost is consistent with “pure UFW prose is the clean end of the train mix” (REPORT_01 §1). Depth-in-context is worth ≤0.03: what L41 encodes at a position, as read by this NLA, is mostly local. This deflates the expected payoff of the context-length-isolation follow-up (REPORT_02 §8.6).
§5

Explanation length: marker, not mechanism

Across activations, longer and quote-denser explanations accompany lower error (expl_len r=−0.41, expl_qfrac r=−0.36). But within an activation — across its 10 temp-1 samples, activation fixed effects removed via demeaning (10,000 samples) — the correlation evaporates. The contrast is the result.

Fig. 6 Spearman(mse, expl_len) across vs within activations. Across: −0.406. Within (demeaned, 10,000 samples): −0.002, p=0.86; n_spans −0.034, p<0.001 but negligible.
FindingThe AV writes longer when there is decodable content; sampling a longer explanation for a fixed vector buys nothing. Independent cheap-side confirmation that the 148-token RL-pinned budget isn’t the binding constraint sample-to-sample — more test-time tokens don’t help, matching the best-of-N convergence (REPORT_01 §5).
§6

Predictability of failure

0.556
cv-ρ², all 13 features
rank-OLS, no doc leakage
0.451
cv-ρ², atypicality alone
one feature, no decode
pre-decode
strongest predictors
atypicality · raw norm · doc-mate cos
free
coverage flag
no AV call needed

Over half of per-sample error rank is predictable before any decode — and the strongest predictors are computable from the activation alone.

Operational consequenceA calibrated “this sample will decode badly” flag is free. The OpenInterp critique’s legitimate half — FVE won’t warn you off-distribution (REPORT_04 §11) — has a cheap mitigation: distance-from-training-mean as a coverage gate, no AV call needed.
§7

Caveats

#caveat
1Per-activation error is temp-1 mean-of-10, not greedy (aggregate gap 0.003 FVE; rank order essentially shared).
2Strata are observational and correlated (dirty docs contain the digits and punctuation; atypicality correlates with doc_alpha_frac) — the rank-OLS handles joint attribution only in rank space. No causal claims.
3n=39 for digit tokens (FVE 0.561 ± ~0.05); POS cells under n=15 suppressed.
4Same 200-doc clustering caveat as the whole series; spaCy tags on 240-char tails are approximate for the first words of the tail (the target token is always tail-final, so its tag is reliable).
§8

Verdict

Verdict7/10
The headline 0.775 decomposes into ~0.80–0.86 on the clean-prose lexical material that dominates the eval mix, with a tolerated tail of digits (0.56), punctuation (0.71), boilerplate docs (0.71), and far-from-mean activations (0.65 global / ~0.76 own-baseline). Nothing here weakens the release — the strata are exactly what the confabulated-semantic-span mechanism (REPORT_01 §3–4) predicts, and the proportional-fidelity flatness across the activation distribution is a genuinely good property. The two actionable outputs: digits as the measured capacity boundary of the text channel, and a free pre-decode difficulty flag. Cheap, clarifying, no surprises that contradict the series; raises the value of follow-up #12 (random-string capacity) and lowers #6 (context-length isolation).