Causal KL: does the reconstruction error matter downstream?
The AR's reconstructions spliced back into gemma-3-27b-it at layer 41 — next-token KL vs the clean forward pass, against matched-MSE noise and ablation floors. The SAE loss-recovered eval, run on an NLA for the first time.
- substrate
- the 1000 FVE-eval activations (UFW en 100k–100.2k, seed 42)
- conditions
- clean · gold · recon ×2 · matched noise · resample ×2 · zero · mean
- readout
- KL(clean‖patched) + ΔCE, offsets 0–31 from the patch
- floor checks
- re-extract cos>0.999 ✓ · gold repatch KL 4e-4 ✓ · FVE 0.7749 ✓
Why FVE can't answer this question
Every number in this series so far is FVE — a geometric metric that weights all 5376 directions by their variance. The model does not: downstream attention and MLPs read specific subspaces, and a residual of fixed size can be causally invisible or load-bearing depending on where it points. The standard answer in SAE evaluation is to splice the reconstruction into the forward pass and measure what happens to the logits. The NLA paper never runs this; its causal evidence is edit-steering at ~50% reliability.
Here, for each of the 1000 eval positions: one batched forward of the base model with nine variants of the layer-41 hidden state at the extraction position — untouched, the stored gold re-injected (harness floor), the AR reconstruction (direction at gold norm, and at its raw norm), gold plus isotropic noise scaled to the reconstruction’s exact error norm (the decisive control), two resample ablations (a real activation from elsewhere in the same document, and one from a different document — the on-manifold “content removed” floors, à la causal scrubbing), zero, and the eval-set mean. Per-position KL of the next-token distribution against clean, plus CE on the true continuation.
The channel passes the sufficiency test
| condition | KL@0 med | KL@0 mean | KL@0 p90 | ΔCE@0 med | loss recov. |
|---|---|---|---|---|---|
| gold repatch (floor) | 0.0004 | 0.001 | 0.002 | +0.0000 | 100.0% |
| AR reconstruction (gold norm) | 0.0657 | 0.219 | 0.415 | +0.0025 | 98.7% |
| AR reconstruction (raw norm) | 0.0773 | 0.246 | 0.485 | +0.0012 | 98.5% |
| matched-MSE noise | 0.0238 | 0.231 | 0.252 | +0.0053 | 98.1% |
| resample, same doc | 12.3585 | 12.498 | 19.431 | +11.8504 | 3.3% |
| resample, cross-doc | 12.4388 | 12.693 | 19.593 | +12.0077 | 0.0% |
| zero ablation | 6.5473 | 7.521 | 15.023 | +6.0187 | 41.5% |
| mean ablation | 5.7144 | 6.597 | 13.635 | +5.2216 | 50.7% |
KL in nats at the patched position, n=1000. Loss recovered = 1 − ΔCE/ΔCE(cross-doc resample), at offset 0 — the on-manifold floor; zero/mean rows shown for comparison.
The floors themselves carry a finding. Swapping in a real activation with the wrong content (12.4 nats) is twice as damaging as deleting the state outright (zero 6.5, mean 5.7): off-manifold vectors read as “missing data” and degrade gracefully, while on-manifold wrong content is trusted and propagated confidently. And the same-document resample is no gentler than the cross-document one (12.36 vs 12.44) — sharing topic, register and style buys essentially nothing. What a position carries that matters is its local, position-specific payload.
But FVE doesn't rank the damage
Per item, geometric reconstruction error and causal damage are uncorrelated: Spearman ρ = −0.025 (p = 0.42) between direction-MSE and KL@0. The quantity the whole series scores items by carries no information about which items’ computations were actually hurt. What does predict causal damage is the position’s own uncertainty — clean next-token CE correlates at ρ = 0.55. Fragility lives in the position, not in the reconstruction quality.
The residual is structured — and concentrated where FVE looks best
At every position the noise control has exactlythe same error norm as the reconstruction; only the direction differs. If the AR’s residual were causally generic, the two rows in §2 would match. They don’t: the reconstruction is 2.3× more damaging at the paired median, worse on 70% of items. The error the text channel makes points into used subspace more often than chance — systematic (toward the AR’s priors), not isotropic.
| geometric-error quartile | direction-MSE | FVE-equiv | KL recon med | KL noise med | recon/noise |
|---|---|---|---|---|---|
| Q1 (best→worst recon) | 0.0024–0.0082 | 0.891 | 0.0614 | 0.0176 | 3.10× |
| Q2 (best→worst recon) | 0.0082–0.0111 | 0.835 | 0.0789 | 0.0260 | 2.61× |
| Q3 (best→worst recon) | 0.0111–0.0154 | 0.778 | 0.0755 | 0.0324 | 2.29× |
| Q4 (best→worst recon) | 0.0154–0.0937 | 0.596 | 0.0468 | 0.0297 | 1.05× |
Paired recon/noise KL ratio by quartile of geometric error. The ratio collapses to 1 exactly where reconstruction is worst.
The gradient is the interesting part. Where the AR reconstructs best (Q1, FVE-equiv ≈ 0.90), its small residual is the most causally loaded — 3.1× noise. Where it reconstructs worst(Q4), the large residual behaves exactly like noise (1.05×). Read with REPORT_05 (hard items are atypical, high-norm positions): on typical prose the channel has learned the meaningful subspace and its leftover error lives there too; on atypical positions it misses in directions the local computation doesn’t even read. Digit tokens — FVE’s famous weak spot at 0.56 — are causally the most robust positions here: median KL 0.004 vs 0.070 for non-digit (n=39, so indicative).
The resample conditions calibrate this scale from the other end. KL is locally quadratic in the perturbation, so KL per unit of squared error norm measures how much of an error’s energy lands in the directions this position’s computation actually reads. Three anchors: isotropic noise (chance), the AR residual, and a pure content difference (resample — a real activation, wrong content):
Solving the mixture: if the residual were α·content-error + (1−α)·chance-like error, its per-energy damage implies α ≈ 5.5%. (A lower bound on inertness: at 12 nats the quadratic regime is bending, which understates the content anchor’s slope — true α is, if anything, smaller.) A detail worth keeping: per unit energy the same-doc resample is more damaging than cross-doc (42× vs 35×) — the components two positions of one document share are causally cheap, so what remains after sharing is purer payload.
One position is consumed locally
Even swapping the position’s entire content for someone else’s barely registers one token later: median KL falls from 12.4 to 0.008 nats at offset 1 and is at the measurement floor by offset 5 (zero-ablation behaves the same, 6.5 → 0.006). A single layer-41 hidden state is, in the median, read once for its own next-token prediction and then almost fully reconstructed from context by attention at later positions. Single-position splicing therefore measures the local causal role; distributed roles (the payload a position contributes to far-away copies) need multi-position patches — the natural follow-up.
What this changes
For the paper’s pitch — explanations you can trust because they reconstruct — the split verdict matters. Aggregate sufficiency (§2) strengthens it: the channel demonstrably carries what the model uses. The decoupling (§3) weakens the per-item reading: an explanation’s reconstruction score says nothing about whether the unexplained residue was the causally important part. And §4 says the residue is not innocent — it is biased into used subspace, most so exactly where the score looks best.
Methodologically: causal KL is ~40 minutes of B200 for the whole eval set, needs no SGLang, and re-scores any reconstruction route (ridge, adapters, cross-family) in the units that matter. The cross-family interlingua claim in particular deserves this treatment — FVE 0.46 there could be causally fine or causally empty, and §3 says FVE cannot tell us which.