Transformer Circuits follow-up · 2026-06
Natural Language
Autoencoders — experiments
Corroboration, ablation and transfer studies on the released Gemma-3-27B-L41 and Qwen2.5-7B-L20 checkpoints. One presentation per report.
- 005-progress005-progress→
- 007-pt-swapExp 007 — The base model through the IT-trained NLAThe radius test that wasn't: cos(pt,it) = 0.989 at L41 — instruction tuning barely rotates this layer on web prose. The frozen NLA reads its base-model ancestor at FVE 0.627, 81% of ceiling, +0.10 over zero-pt-info baselines.→
- 008-progressExp 008 — 20 cells at 20k pairs (live)Doubles probe training data on 20 selected (stack, layer) cells; levels should lift ~uniformly, shapes should hold.→
- 009-adaptersExp 009 — Adapters vs the text routeMLP and RBF adapters lose to plain ridge at 20k pairs — the zero-pair text route stands, no linearity caveat. PCA calibration: same-model 0.774 ≈ a rank-2927/5376 linear code.→
- 011-progressExp 011 — run telemetry (complete)Live progress page for the causal-KL run: stages, per-condition KL medians, decay curves.→
- 012-interventionsExp 012b — Verbalization interventions, side by sideThe same 18 eval sites under every intervention — quote bans, redaction, quotes-only grammars — with per-site FVE. The payload migrates into prose when quotes are banned (0.62) and dies when prose is (0.13 scaffolded, −0.8 bare): prose is generative scaffolding.→
- 018-crossfamily018-crossfamily→
- 019-verbalization-gridExp 019 — The verbalization-intervention gridPost-hoc surgery, sampler bans, grammar forcing, prefill hijack — 22 cells on one substrate. 16/16 replications within ±0.002. New: marks amplify tampered content (−0.76 vs −0.59 bare), the policy escapes a 10k-token mark ban via Hebrew gershayim, and forcing span production backfires (−0.244 vs 0.130 unforced).→
- 008report-008→
- 01Corroboration & ablation seriesFVE 0.775 reproduced; quotes are confabulated, not copied; spans = payload, marks = type tags, prose = glue; text beats a 20k-pair ridge map.→
- 011Exp 011 — Causal KL: does the reconstruction error matter?The SAE loss-recovered eval the paper never ran: 98.7% recovered vs the resample floor — but FVE and causal damage decouple per-item (ρ≈0), ~5% of the residual's energy is lost content, and wrong-but-realistic content is 2× worse than deletion.→
- 02Evaluation & ablation reportThe seven-point TL;DR cut: format-keyed code, converged policy, finetune-drift survival, and the AR's value head as a near-no-op.→
- 03Corroboration series — verdict editionThe same series scored: 8.5/10. Adds the AV/AR self-drift check (the AV moved 3× more than the AR).→
- 04Probing the released NLAFullest writeup: three corroborations, the OpenInterp two-tier critique assessed, reproducibility inventory.→
- 05Stratified error analysisWhere the 0.775 lives: clean prose ~0.80–0.86, digits 0.56; error is predictable before any decode (cv-ρ² 0.556).→
- 06Recursion: iterated AV∘AR dynamicsEight round-trips: FVE 0.774 → 0.416, a conveyor belt not a basin; Bartlett's serial reproduction, mechanically reproduced.→
- s3-agreementS3d — Verbalizations vs top SAE features (exp 020)1000 activations read both ways: NLA prose next to top-10 SAE features with labels. The paper's 'strong agreement' claim, made browsable.→
- s3-residualS3c — SAE on the reconstruction residuals (exp 020)What does the text channel drop? The sanity gate answers first: residuals are as alien to the SAE as matched Gaussian noise (FVU ~1554 vs ~1453) — the enrichment table is artifact, reported anyway.→
- s3-saeS3b — Matched SAE features through the AV (exp 020)GemmaScope-2 decoder directions nearest each axis (cos recorded, low-cos kept), run through the identical cells — autointerp labels and logit-lens tokens shown.→
- s3-vectorsS3a — Diff-of-means vectors into the AV (exp 020)13 contrast axes + a norm-matched random control: bare directions decoded at four injection scales, and steered document forwards decoded against the unsteered baseline with the behavioral completion adjacent.→
- statusBox status — GPUs, VRAM, baton queue (live)Utilization and VRAM graphs for every GPU on the box, the GPU-baton queue, and current compute processes. 10s sampling, 3h window.→