← NLA experimentsNatural Language Autoencoders · follow-up series
08

Probe curves: where the activation lives in each stack

Ridge probes reconstruct the gold L41 activation from every layer of the base model, the verbalizer, and the reconstructor — then 20 cells refit at double the training data to calibrate what the levels mean.

date
2026-06-11
experiments
005 (full curves, 10k pairs) + 008 (20 cells, 20k pairs)
substrate
12k+10k positions · UFW-en docs 110000–114799 · stage-0, seed 42
eval
1,994 doc-held-out positions · raw-mean denominator 0.0578
0.94→0.96
identity cell, 10k→20k pairs
even x→x is data-limited; levels are lower bounds
0.85
AV injection plateau, L38–44
dispersed and retained — not consumed
+0.050
mean Δ across 20 refit cells
all positive, no shape crossings
0.69 / 0.76
AR@42 ridge vs value head
gap halves when data doubles
§1

Setup

Every probe answers the same question: from the residual stream at layer ℓ of stack S, how much of the original base-model L41 activation (direction-normalized, the eval convention of this series) can a centered linear map recover? 12,000 positions from unseen UFW-en docs; the AV is re-forwarded on its own greedy explanation of each vector and read at five named positions; the AR reads each explanation through the critic template, last token. 421 ridge fits (GCV-selected λ from the train-gram eigendecomposition), one shared FVE denominator, doc-level split.

Two reference cells calibrate everything: the identity cell (base hs42 → its own direction) and the value-head cell (AR hs42, where the shipped Linear(d,d) head reads — trained on millions of pairs, our exact probe class at its data-rich limit).

§2

The data haircut

FindingThe identity cell reads 0.94, not 1.0 — and post-hoc normalization doesn't fix it (0.944), while a half-train fit collapses (0.26). 10k pairs underfeeds a 5376² map. Every probe value in this report is a lower bound; curve shapes, not levels, are the findings.
cell10k pairs20k pairsΔasymptote
base@42 (identity)0.9420.963+0.0211.0 by construction
ar@42 (value-head input)0.6150.687+0.0720.7645 (value head)

Doubling data moves both cells toward their known ceilings; the AR gap roughly halves. Consistent with pure data limitation, not a probe-class deficiency.

§3

Base: computed late, discarded fast

Fig. 1 Base model, all 63 hidden states → gold L41 direction. The content is built in a narrow band (~L35–41) and rotates away almost symmetrically after. Mid-stack carries almost nothing of it (L20: 0.16).

The flat early ramp kills the "information was always linearly present" reading: a linear map cannot pull L41's content out of L20 (0.16) because the content does not exist yet. Even adjacent L40 probes at only 0.71 — block 41 does real computational work. The post-peak decay (0.62 by L46, 0.21 at L62) says the stream abandons this representation nearly as fast as it built it.

§4

AV: the vector is parked, not consumed

Fig. 2 AV stack during explanation generation, five positions × 63 layers. The injected vector stays ~fully recoverable at its own position for a third of the stack and plateaus at 0.85 through the extraction band; positions doing the writing hold a flat ~0.2.
FindingConsumed-vs-dispersed (idea #22) resolves to dispersed-and-retained: 1.00 through L17, a 0.85 plateau across L38–44, still 0.45 at the final layer (20k value). Meanwhile the generation positions never hold more than ~a quarter of the vector — the AV reads the parked copy through attention each step rather than copying it forward.
§5

AR: a smooth climb, no snap

Fig. 3 AR stack at the last token, 43 hidden states. Reconstruction information accumulates smoothly through the depth — mirroring the base model's own L35–41 build — and is still rising at truncation, where the value head reads.

No phase transition: the explanation text is converted back into the activation gradually, with the steep gains in the last third. The value head (0.7645 on this eval set; published 0.775) sits exactly where the data-rich extrapolation of the curve's endpoint should land.

§6

20k validation: levels lift, shapes hold

Fig. 4 The 20 refit cells: dashed = 10k pairs (005), solid = 20k pairs (008). Base and AV-injection selections shown; every cell lifts, orderings preserved.
stacklayerFVE @10kFVE @20kΔ
base200.1550.182+0.027
base370.4560.518+0.062
base400.7080.770+0.062
base410.8050.857+0.052
base420.9420.963+0.021
base430.8250.874+0.049
base460.6240.696+0.073
ar200.1720.202+0.030
ar260.2040.230+0.026
ar340.2980.342+0.044
ar390.4460.514+0.069
ar420.6150.687+0.072
av_injection170.9960.997+0.002
av_injection300.9090.939+0.029
av_injection410.8510.898+0.047
av_injection500.7320.801+0.069
av_injection620.3510.454+0.103
av_gen_final330.2140.268+0.054
av_gen_final450.2210.277+0.056
av_gen_final620.1990.254+0.055

Mean Δ +0.050, range +0.002…+0.103. Largest lifts where inputs are hardest (AV L62: +0.103) and smallest at saturation (AV injection L17: +0.002) — the signature of data limitation, not noise.

§7

Prediction scorecard

Verdictcurves validated, levels are floors
The three stacks divide the labor cleanly: the base computes the L41 representation late and discards it fast; the AV parks the injected copy at its token and streams ~quarter-strength summaries to the positions writing the explanation; the AR rebuilds the activation from text gradually, reaching ~90% of its data-rich ceiling by truncation depth. Experiment 008 says all of this survives doubling the probe data — shapes are stable, absolute numbers are floors with a known haircut. Single-position probes still undercount KV-resident information; that is the one caveat a position-sweep follow-up could close.