Probe curves: where the activation lives in each stack
Ridge probes reconstruct the gold L41 activation from every layer of the base model, the verbalizer, and the reconstructor — then 20 cells refit at double the training data to calibrate what the levels mean.
- date
- 2026-06-11
- experiments
- 005 (full curves, 10k pairs) + 008 (20 cells, 20k pairs)
- substrate
- 12k+10k positions · UFW-en docs 110000–114799 · stage-0, seed 42
- eval
- 1,994 doc-held-out positions · raw-mean denominator 0.0578
Setup
Every probe answers the same question: from the residual stream at layer ℓ of stack S, how much of the original base-model L41 activation (direction-normalized, the eval convention of this series) can a centered linear map recover? 12,000 positions from unseen UFW-en docs; the AV is re-forwarded on its own greedy explanation of each vector and read at five named positions; the AR reads each explanation through the critic template, last token. 421 ridge fits (GCV-selected λ from the train-gram eigendecomposition), one shared FVE denominator, doc-level split.
Two reference cells calibrate everything: the identity cell (base hs42 → its own direction) and the value-head cell (AR hs42, where the shipped Linear(d,d) head reads — trained on millions of pairs, our exact probe class at its data-rich limit).
The data haircut
| cell | 10k pairs | 20k pairs | Δ | asymptote |
|---|---|---|---|---|
| base@42 (identity) | 0.942 | 0.963 | +0.021 | 1.0 by construction |
| ar@42 (value-head input) | 0.615 | 0.687 | +0.072 | 0.7645 (value head) |
Doubling data moves both cells toward their known ceilings; the AR gap roughly halves. Consistent with pure data limitation, not a probe-class deficiency.
Base: computed late, discarded fast
The flat early ramp kills the "information was always linearly present" reading: a linear map cannot pull L41's content out of L20 (0.16) because the content does not exist yet. Even adjacent L40 probes at only 0.71 — block 41 does real computational work. The post-peak decay (0.62 by L46, 0.21 at L62) says the stream abandons this representation nearly as fast as it built it.
AV: the vector is parked, not consumed
AR: a smooth climb, no snap
No phase transition: the explanation text is converted back into the activation gradually, with the steep gains in the last third. The value head (0.7645 on this eval set; published 0.775) sits exactly where the data-rich extrapolation of the curve's endpoint should land.
20k validation: levels lift, shapes hold
| stack | layer | FVE @10k | FVE @20k | Δ |
|---|---|---|---|---|
| base | 20 | 0.155 | 0.182 | +0.027 |
| base | 37 | 0.456 | 0.518 | +0.062 |
| base | 40 | 0.708 | 0.770 | +0.062 |
| base | 41 | 0.805 | 0.857 | +0.052 |
| base | 42 | 0.942 | 0.963 | +0.021 |
| base | 43 | 0.825 | 0.874 | +0.049 |
| base | 46 | 0.624 | 0.696 | +0.073 |
| ar | 20 | 0.172 | 0.202 | +0.030 |
| ar | 26 | 0.204 | 0.230 | +0.026 |
| ar | 34 | 0.298 | 0.342 | +0.044 |
| ar | 39 | 0.446 | 0.514 | +0.069 |
| ar | 42 | 0.615 | 0.687 | +0.072 |
| av_injection | 17 | 0.996 | 0.997 | +0.002 |
| av_injection | 30 | 0.909 | 0.939 | +0.029 |
| av_injection | 41 | 0.851 | 0.898 | +0.047 |
| av_injection | 50 | 0.732 | 0.801 | +0.069 |
| av_injection | 62 | 0.351 | 0.454 | +0.103 |
| av_gen_final | 33 | 0.214 | 0.268 | +0.054 |
| av_gen_final | 45 | 0.221 | 0.277 | +0.056 |
| av_gen_final | 62 | 0.199 | 0.254 | +0.055 |
Mean Δ +0.050, range +0.002…+0.103. Largest lifts where inputs are hardest (AV L62: +0.103) and smallest at saturation (AV injection L17: +0.002) — the signature of data limitation, not noise.
Prediction scorecard
- ✗Base FVE = 1 at L41 (trivially)— 0.94 at 10k — even the identity map is data-limited at d=5376. The most instructive miss of the series.
- ~AR output lands at the published 0.77— value head: 0.7645 ✓; a fresh 10k-pair ridge on the same input: 0.61. Probe class ≠ probe data.
- ✗Base mid-stack near-saturated (ridge undoes drift)— L20 = 0.16. Linear maps can't recover content that hasn't been computed.
- ✓AV vector dispersed rather than consumed— 0.85 plateau through the extraction band; 0.45 even at the last layer (20k).
- ✓Doubling data lifts levels, preserves shapes— 20/20 cells positive, no crossings, lifts largest where furthest from ceiling.