← NLA experimentsNatural Language Autoencoders · follow-up series

Probe curves: where the activation lives in each stack

Ridge probes reconstruct the gold L41 activation from every layer of the base model, the verbalizer, and the reconstructor — then 20 cells refit at double the training data to calibrate what the levels mean.

date: 2026-06-11
experiments: 005 (full curves, 10k pairs) + 008 (20 cells, 20k pairs)
substrate: 12k+10k positions · UFW-en docs 110000–114799 · stage-0, seed 42
eval: 1,994 doc-held-out positions · raw-mean denominator 0.0578

0.94→0.96

identity cell, 10k→20k pairs

even x→x is data-limited; levels are lower bounds

0.85

AV injection plateau, L38–44

dispersed and retained — not consumed

+0.050

mean Δ across 20 refit cells

all positive, no shape crossings

0.69 / 0.76

AR@42 ridge vs value head

gap halves when data doubles

§1

Setup

one probe class, one denominator

Every probe answers the same question: from the residual stream at layer ℓ of stack S, how much of the original base-model L41 activation (direction-normalized, the eval convention of this series) can a centered linear map recover? 12,000 positions from unseen UFW-en docs; the AV is re-forwarded on its own greedy explanation of each vector and read at five named positions; the AR reads each explanation through the critic template, last token. 421 ridge fits (GCV-selected λ from the train-gram eigendecomposition), one shared FVE denominator, doc-level split.

Two reference cells calibrate everything: the identity cell (base hs42 → its own direction) and the value-head cell (AR hs42, where the shipped Linear(d,d) head reads — trained on millions of pairs, our exact probe class at its data-rich limit).

§2

The data haircut

calibration before curves

Finding —The identity cell reads 0.94, not 1.0 — and post-hoc normalization doesn't fix it (0.944), while a half-train fit collapses (0.26). 10k pairs underfeeds a 5376² map. Every probe value in this report is a lower bound; curve shapes, not levels, are the findings.

cell	10k pairs	20k pairs	Δ	asymptote
base@42 (identity)	0.942	0.963	+0.021	1.0 by construction
ar@42 (value-head input)	0.615	0.687	+0.072	0.7645 (value head)

Doubling data moves both cells toward their known ceilings; the AR gap roughly halves. Consistent with pure data limitation, not a probe-class deficiency.

§3

Base: computed late, discarded fast

Fig. 1 — Base model, all 63 hidden states → gold L41 direction. The content is built in a narrow band (~L35–41) and rotates away almost symmetrically after. Mid-stack carries almost nothing of it (L20: 0.16).

The flat early ramp kills the "information was always linearly present" reading: a linear map cannot pull L41's content out of L20 (0.16) because the content does not exist yet. Even adjacent L40 probes at only 0.71 — block 41 does real computational work. The post-peak decay (0.62 by L46, 0.21 at L62) says the stream abandons this representation nearly as fast as it built it.

§4

AV: the vector is parked, not consumed

Fig. 2 — AV stack during explanation generation, five positions × 63 layers. The injected vector stays ~fully recoverable at its own position for a third of the stack and plateaus at 0.85 through the extraction band; positions doing the writing hold a flat ~0.2.

Finding —Consumed-vs-dispersed (idea #22) resolves to dispersed-and-retained: 1.00 through L17, a 0.85 plateau across L38–44, still 0.45 at the final layer (20k value). Meanwhile the generation positions never hold more than ~a quarter of the vector — the AV reads the parked copy through attention each step rather than copying it forward.

§5

AR: a smooth climb, no snap

Fig. 3 — AR stack at the last token, 43 hidden states. Reconstruction information accumulates smoothly through the depth — mirroring the base model's own L35–41 build — and is still rising at truncation, where the value head reads.

No phase transition: the explanation text is converted back into the activation gradually, with the steep gains in the last third. The value head (0.7645 on this eval set; published 0.775) sits exactly where the data-rich extrapolation of the curve's endpoint should land.

§6

20k validation: levels lift, shapes hold

experiment 008

Fig. 4 — The 20 refit cells: dashed = 10k pairs (005), solid = 20k pairs (008). Base and AV-injection selections shown; every cell lifts, orderings preserved.

stack	layer	FVE @10k	FVE @20k	Δ
base	20	0.155	0.182	+0.027
base	37	0.456	0.518	+0.062
base	40	0.708	0.770	+0.062
base	41	0.805	0.857	+0.052
base	42	0.942	0.963	+0.021
base	43	0.825	0.874	+0.049
base	46	0.624	0.696	+0.073
ar	20	0.172	0.202	+0.030
ar	26	0.204	0.230	+0.026
ar	34	0.298	0.342	+0.044
ar	39	0.446	0.514	+0.069
ar	42	0.615	0.687	+0.072
av_injection	17	0.996	0.997	+0.002
av_injection	30	0.909	0.939	+0.029
av_injection	41	0.851	0.898	+0.047
av_injection	50	0.732	0.801	+0.069
av_injection	62	0.351	0.454	+0.103
av_gen_final	33	0.214	0.268	+0.054
av_gen_final	45	0.221	0.277	+0.056
av_gen_final	62	0.199	0.254	+0.055

Mean Δ +0.050, range +0.002…+0.103. Largest lifts where inputs are hardest (AV L62: +0.103) and smallest at saturation (AV injection L17: +0.002) — the signature of data limitation, not noise.

§7

Prediction scorecard

kept honest

✗
Base FVE = 1 at L41 (trivially)— 0.94 at 10k — even the identity map is data-limited at d=5376. The most instructive miss of the series.
~
AR output lands at the published 0.77— value head: 0.7645 ✓; a fresh 10k-pair ridge on the same input: 0.61. Probe class ≠ probe data.
✗
Base mid-stack near-saturated (ridge undoes drift)— L20 = 0.16. Linear maps can't recover content that hasn't been computed.
✓
AV vector dispersed rather than consumed— 0.85 plateau through the extraction band; 0.45 even at the last layer (20k).
✓
Doubling data lifts levels, preserves shapes— 20/20 cells positive, no crossings, lifts largest where furthest from ceiling.

Verdictcurves validated, levels are floors

The three stacks divide the labor cleanly: the base computes the L41 representation late and discards it fast; the AV parks the injected copy at its token and streams ~quarter-strength summaries to the positions writing the explanation; the AR rebuilds the activation from text gradually, reaching ~90% of its data-rich ceiling by truncation depth. Experiment 008 says all of this survives doubling the probe data — shapes are stable, absolute numbers are floors with a known haircut. Single-position probes still undercount KV-resident information; that is the one caveat a position-sweep follow-up could close.