← NLA experimentsNatural Language Autoencoders · follow-up series

Stratified error analysis: where the 0.775 lives

Per-sample decomposition of the Gemma-3-27B-L41 NLA's reconstruction error, built entirely from existing artifacts — no model forwards, no GPU.

date: 2026-06-11
hardware: CPU only — GPUs untouched
inputs: bestofn_mses.npy (1000×10) · vecs_1000 · meta_1000 · decodes
artifacts: strat_features.json · strat_results.json

0.799

word-start FVE

n=729 — the easy bulk

0.561

digit-token FVE

n=39 — the true weak spot

+0.672

Spearman r, atypicality

…but mostly mechanical (§2)

0.556

cv-ρ² predictability

error rank, before any decode

§0

Method

no new decodes

Per-activation error = mean of the 10 temp-1.0 sample NMSEs from the best-of-N run (mean-of-10 FVE 0.772 ≈ greedy 0.775). Bucket FVEs use the train denominator 0.0579 as throughout the series. The 200 eval docs (UFW en 100000–100199) were re-streamed and positions re-derived with the repo’s sampler (seed 42, 5/doc, min 50): all 1000 (doc, pos) pairs matched meta_1000.json exactly, and the recomputed variance baseline reproduced (0.0575).

Thirteen features per sample, in four families: gold geometry (atypicality = per-sample predict-the-mean error, whose mean 0.0575 is the FVE denominator — so it is each sample’s own baseline; raw norm; doc-mate cos), token (word-start vs subword, punctuation, digit, length, Zipf frequency, POS via spaCy on the 240-char context tail), document (length, relative position, alpha fraction, line length), and explanation (char length, n quoted spans, quoted fraction).

§1

What correlates with error

Spearman · n=1000

Fig. 1 — Spearman correlation of per-activation NMSE with each feature. Gold-vector geometry dominates (atypicality +0.672, raw norm −0.645, doc-mate cos −0.503); document-position features are noise-level; expl_nspans is not significant.

§2

Atypicality: big gradient, mostly mechanical

own-baseline rescoring

FVE by atypicality quartile under the global denominator falls from 0.861 to 0.651 — apparently a steep difficulty gradient. But each sample’s atypicality is its own predict-the-mean baseline, so far-from-mean samples have more variance to explain by construction. Rescoring each sample against its own baseline (own-FVE = 1 − mse_i/atyp_i) nearly flattens it.

FVE, global denominator

FVE, own baseline (1 − mse/atyp)

0.6FVE0.9

Fig. 2 — FVE by atypicality quartile (Q1 most typical → Q4 most atypical). Global denominator: 0.861 / 0.814 / 0.762 / 0.651. Own-baseline: 0.796 / 0.779 / 0.762 / 0.760 — the gradient nearly vanishes. Spearman(atyp, own-FVE) = −0.14: a real but small residual penalty.

Finding —Most of the r=+0.67 is denominator mechanics; the AV’s proportional fidelity is nearly flat across the activation distribution — it tracks unusual activations almost as well, proportionally, as typical ones. Robustness echo of the adjacent-layer result (REPORT_01 §6): the vector input side degrades gracefully.

§3

Token-type strata: the digit hole

non-lexical material

Fig. 3 — FVE by token type. Word-start tokens (the bulk of the eval) decode at 0.799; digit tokens at 0.561 — the channel's true weak spot. Dashed line: the 0.775 headline.

POS among word-start tokens is flat — the verbalization channel handles all word classes equally; its weak spot is non-lexical material. Only PROPN dips.

word-start POS classes

PROPN

ADP

PRON

NOUN

VERB

ADJ

ADV

AUX

0.7FVE0.9

Fig. 4 — FVE by POS among word-start tokens, on a deliberately tight domain. NOUN 0.796, VERB 0.802, ADJ 0.807, ADV 0.823, AUX 0.845, ADP 0.788, PRON 0.793 cluster tightly; PROPN (0.759) is the only dip. Cells under n=15 suppressed.

The worst-12 list says it plainly — citation, date, and numeric contexts dominate:

11:145-50page range, journal citation
(2015citation year
(00:24timestamp
the 1of “1990s”-style
,comma in an address list
GSTRtax-form date

Finding —Best-6 samples are function/content words in clean argumentative prose (mse ≈ 0.003–0.004, FVE > 0.93). Coherent with REPORT_01 §3: spans are semantic reconstructions, lexically inexact — a 148-token confabulated-span explanation cannot pin arbitrary digit strings; that is exactly the content semantic paraphrase cannot carry. Direct follow-up: idea #12, random-string channel capacity.

§4

Document effects: cleanliness yes, depth no

quartile FVEs

doc alpha-frac (dirty → clean)

doc length (72 → 2048 tok)

relative position in doc

0.7FVE0.8

Fig. 5 — FVE by document-feature quartile, shared domain. Alpha-fraction (dirty→clean) spans 0.713→0.819 — boilerplate/listy docs cost ~0.11 FVE. Doc length (72→2048 tok) and relative position are near-flat (≤0.03 spread).

Finding —The ~0.11 FVE cleanliness cost is consistent with “pure UFW prose is the clean end of the train mix” (REPORT_01 §1). Depth-in-context is worth ≤0.03: what L41 encodes at a position, as read by this NLA, is mostly local. This deflates the expected payoff of the context-length-isolation follow-up (REPORT_02 §8.6).

§5

Explanation length: marker, not mechanism

demeaned within-activation

Across activations, longer and quote-denser explanations accompany lower error (expl_len r=−0.41, expl_qfrac r=−0.36). But within an activation — across its 10 temp-1 samples, activation fixed effects removed via demeaning (10,000 samples) — the correlation evaporates. The contrast is the result.

Fig. 6 — Spearman(mse, expl_len) across vs within activations. Across: −0.406. Within (demeaned, 10,000 samples): −0.002, p=0.86; n_spans −0.034, p<0.001 but negligible.

Finding —The AV writes longer when there is decodable content; sampling a longer explanation for a fixed vector buys nothing. Independent cheap-side confirmation that the 148-token RL-pinned budget isn’t the binding constraint sample-to-sample — more test-time tokens don’t help, matching the best-of-N convergence (REPORT_01 §5).

§6

Predictability of failure

doc-level 5-fold CV

0.556

cv-ρ², all 13 features

rank-OLS, no doc leakage

0.451

cv-ρ², atypicality alone

one feature, no decode

pre-decode

strongest predictors

atypicality · raw norm · doc-mate cos

free

coverage flag

no AV call needed

Over half of per-sample error rank is predictable before any decode — and the strongest predictors are computable from the activation alone.

Operational consequence —A calibrated “this sample will decode badly” flag is free. The OpenInterp critique’s legitimate half — FVE won’t warn you off-distribution (REPORT_04 §11) — has a cheap mitigation: distance-from-training-mean as a coverage gate, no AV call needed.

§7

Caveats

observational

#	caveat
1	Per-activation error is temp-1 mean-of-10, not greedy (aggregate gap 0.003 FVE; rank order essentially shared).
2	Strata are observational and correlated (dirty docs contain the digits and punctuation; atypicality correlates with doc_alpha_frac) — the rank-OLS handles joint attribution only in rank space. No causal claims.
3	n=39 for digit tokens (FVE 0.561 ± ~0.05); POS cells under n=15 suppressed.
4	Same 200-doc clustering caveat as the whole series; spaCy tags on 240-char tails are approximate for the first words of the tail (the target token is always tail-final, so its tag is reliable).

§8

Verdict

Verdict7/10

The headline 0.775 decomposes into ~0.80–0.86 on the clean-prose lexical material that dominates the eval mix, with a tolerated tail of digits (0.56), punctuation (0.71), boilerplate docs (0.71), and far-from-mean activations (0.65 global / ~0.76 own-baseline). Nothing here weakens the release — the strata are exactly what the confabulated-semantic-span mechanism (REPORT_01 §3–4) predicts, and the proportional-fidelity flatness across the activation distribution is a genuinely good property. The two actionable outputs: digits as the measured capacity boundary of the text channel, and a free pre-decode difficulty flag. Cheap, clarifying, no surprises that contradict the series; raises the value of follow-up #12 (random-string capacity) and lowers #6 (context-length isolation).