← NLA experimentsNatural Language Autoencoders · follow-up series
11

Causal KL: does the reconstruction error matter downstream?

The AR's reconstructions spliced back into gemma-3-27b-it at layer 41 — next-token KL vs the clean forward pass, against matched-MSE noise and ablation floors. The SAE loss-recovered eval, run on an NLA for the first time.

substrate
the 1000 FVE-eval activations (UFW en 100k–100.2k, seed 42)
conditions
clean · gold · recon ×2 · matched noise · resample ×2 · zero · mean
readout
KL(clean‖patched) + ΔCE, offsets 0–31 from the patch
floor checks
re-extract cos>0.999 ✓ · gold repatch KL 4e-4 ✓ · FVE 0.7749 ✓
§1

Why FVE can't answer this question

Every number in this series so far is FVE — a geometric metric that weights all 5376 directions by their variance. The model does not: downstream attention and MLPs read specific subspaces, and a residual of fixed size can be causally invisible or load-bearing depending on where it points. The standard answer in SAE evaluation is to splice the reconstruction into the forward pass and measure what happens to the logits. The NLA paper never runs this; its causal evidence is edit-steering at ~50% reliability.

Here, for each of the 1000 eval positions: one batched forward of the base model with nine variants of the layer-41 hidden state at the extraction position — untouched, the stored gold re-injected (harness floor), the AR reconstruction (direction at gold norm, and at its raw norm), gold plus isotropic noise scaled to the reconstruction’s exact error norm (the decisive control), two resample ablations (a real activation from elsewhere in the same document, and one from a different document — the on-manifold “content removed” floors, à la causal scrubbing), zero, and the eval-set mean. Per-position KL of the next-token distribution against clean, plus CE on the true continuation.

§2

The channel passes the sufficiency test

98.7%
loss recovered
vs resample floor; 97.4–97.8% vs mean/zero
0.066
median KL, recon (nats)
vs 12.4 for content swapped — 190× down
0.0004
gold-repatch floor
old-box golds re-injected; harness noise
0.775
FVE re-check
recomputed reconstructions match the series
conditionKL@0 medKL@0 meanKL@0 p90ΔCE@0 medloss recov.
gold repatch (floor)0.00040.0010.002+0.0000100.0%
AR reconstruction (gold norm)0.06570.2190.415+0.002598.7%
AR reconstruction (raw norm)0.07730.2460.485+0.001298.5%
matched-MSE noise0.02380.2310.252+0.005398.1%
resample, same doc12.358512.49819.431+11.85043.3%
resample, cross-doc12.438812.69319.593+12.00770.0%
zero ablation6.54737.52115.023+6.018741.5%
mean ablation5.71446.59713.635+5.221650.7%

KL in nats at the patched position, n=1000. Loss recovered = 1 − ΔCE/ΔCE(cross-doc resample), at offset 0 — the on-manifold floor; zero/mean rows shown for comparison.

FindingThe text channel is causally sufficient in aggregate: splicing the reconstruction in costs a median 0.066 nats of next-token KL — two orders of magnitude above the measurement floor, two orders below the content-swap floor — and 98.7% of resample loss is recovered. The 0.775-FVE code is not a geometric illusion; the model can almost run on it.

The floors themselves carry a finding. Swapping in a real activation with the wrong content (12.4 nats) is twice as damaging as deleting the state outright (zero 6.5, mean 5.7): off-manifold vectors read as “missing data” and degrade gracefully, while on-manifold wrong content is trusted and propagated confidently. And the same-document resample is no gentler than the cross-document one (12.36 vs 12.44) — sharing topic, register and style buys essentially nothing. What a position carries that matters is its local, position-specific payload.

§3

But FVE doesn't rank the damage

Per item, geometric reconstruction error and causal damage are uncorrelated: Spearman ρ = −0.025 (p = 0.42) between direction-MSE and KL@0. The quantity the whole series scores items by carries no information about which items’ computations were actually hurt. What does predict causal damage is the position’s own uncertainty — clean next-token CE correlates at ρ = 0.55. Fragility lives in the position, not in the reconstruction quality.

Fig. 1 Geometric error vs causal damage, per item (400-item subsample, log–log). A cloud, not a line: ρ = −0.03. The x-axis is what FVE measures; the y-axis is what the model feels.
HeadlineFVE certifies the channel, not the items. Aggregate causal sufficiency is real (§2), but the per-item FVE ranking is causally meaningless: a poorly-reconstructed activation is no more likely to disrupt the model than a well-reconstructed one.
§4

The residual is structured — and concentrated where FVE looks best

At every position the noise control has exactlythe same error norm as the reconstruction; only the direction differs. If the AR’s residual were causally generic, the two rows in §2 would match. They don’t: the reconstruction is 2.3× more damaging at the paired median, worse on 70% of items. The error the text channel makes points into used subspace more often than chance — systematic (toward the AR’s priors), not isotropic.

geometric-error quartiledirection-MSEFVE-equivKL recon medKL noise medrecon/noise
Q1 (best→worst recon)0.0024–0.00820.8910.06140.01763.10×
Q2 (best→worst recon)0.0082–0.01110.8350.07890.02602.61×
Q3 (best→worst recon)0.0111–0.01540.7780.07550.03242.29×
Q4 (best→worst recon)0.0154–0.09370.5960.04680.02971.05×

Paired recon/noise KL ratio by quartile of geometric error. The ratio collapses to 1 exactly where reconstruction is worst.

The gradient is the interesting part. Where the AR reconstructs best (Q1, FVE-equiv ≈ 0.90), its small residual is the most causally loaded — 3.1× noise. Where it reconstructs worst(Q4), the large residual behaves exactly like noise (1.05×). Read with REPORT_05 (hard items are atypical, high-norm positions): on typical prose the channel has learned the meaningful subspace and its leftover error lives there too; on atypical positions it misses in directions the local computation doesn’t even read. Digit tokens — FVE’s famous weak spot at 0.56 — are causally the most robust positions here: median KL 0.004 vs 0.070 for non-digit (n=39, so indicative).

The resample conditions calibrate this scale from the other end. KL is locally quadratic in the perturbation, so KL per unit of squared error norm measures how much of an error’s energy lands in the directions this position’s computation actually reads. Three anchors: isotropic noise (chance), the AR residual, and a pure content difference (resample — a real activation, wrong content):

Fig. 2 Causal damage per unit of squared error norm, median across items, normalized to noise = 1. The AR residual (2.9×) sits near the noise end of the noise→content scale (35–42×): systematically biased into used subspace, but ~95% of its energy behaves like causally generic error.

Solving the mixture: if the residual were α·content-error + (1−α)·chance-like error, its per-energy damage implies α ≈ 5.5%. (A lower bound on inertness: at 12 nats the quadratic regime is bending, which understates the content anchor’s slope — true α is, if anything, smaller.) A detail worth keeping: per unit energy the same-doc resample is more damaging than cross-doc (42× vs 35×) — the components two positions of one document share are causally cheap, so what remains after sharing is purer payload.

FindingThe residual is not white — 2.9× chance-aligned with the read subspace, and 2.3× worse than matched noise at the paired median — but it is also not lost content: only ~5% of its energy behaves like semantic substitution. The channel drops mostly causally-inert detail plus a thin, systematic slice of real payload, and that slice is thickest where FVE looks best (Q1). By mean ΔCE the reconstruction is even gentler than noise (98.7% vs 98.1% recovered), because noise’s damage has a fatter worst-case tail on the true token.
§5

One position is consumed locally

Even swapping the position’s entire content for someone else’s barely registers one token later: median KL falls from 12.4 to 0.008 nats at offset 1 and is at the measurement floor by offset 5 (zero-ablation behaves the same, 6.5 → 0.006). A single layer-41 hidden state is, in the median, read once for its own next-token prediction and then almost fully reconstructed from context by attention at later positions. Single-position splicing therefore measures the local causal role; distributed roles (the payload a position contributes to far-away copies) need multi-position patches — the natural follow-up.

Fig. 3 Median KL by token offset from the patch (log10). All conditions — even a full content swap — collapse toward the gold floor within ~5 tokens; the entire measurable effect is at the patched position.
§6

What this changes

For the paper’s pitch — explanations you can trust because they reconstruct — the split verdict matters. Aggregate sufficiency (§2) strengthens it: the channel demonstrably carries what the model uses. The decoupling (§3) weakens the per-item reading: an explanation’s reconstruction score says nothing about whether the unexplained residue was the causally important part. And §4 says the residue is not innocent — it is biased into used subspace, most so exactly where the score looks best.

Methodologically: causal KL is ~40 minutes of B200 for the whole eval set, needs no SGLang, and re-scores any reconstruction route (ridge, adapters, cross-family) in the units that matter. The cross-family interlingua claim in particular deserves this treatment — FVE 0.46 there could be causally fine or causally empty, and §3 says FVE cannot tell us which.

Verdictfinding, 8.5/10
The reconstruction passes the causal test the paper never ran — 98.7% loss recovered against the honest, on-manifold floor — but FVE and causal damage decouple completely at the item level (ρ ≈ 0), and the residual is structured: 2.3× worse than size-matched noise at the median, ~5% of its energy behaving like lost content, thickest where FVE looks best. Two calibration findings ride along: confidently-wrong on-manifold content is 2× as damaging as deleting the state, and a single L41 position’s causal role is consumed almost entirely at the position itself. FVE certifies the code; it neither ranks nor bounds the harm of what the code drops.