Adapters vs the text route — and what an FVE is worth
The owed-honesty check on the interlingua claim: give learned alignment maps every nonlinear advantage, then price every headline number in optimal-linear-code rank.
- date
- 2026-06-11
- idea
- #18 + capacity panel
- data
- 19995 aligned pairs (gemma L41 ↔ qwen L20), fixed 1000-pair eval
- compute
- no serving — salvaged artifacts only
The question and the contract
Report 01 showed gemma explanations reconstruct qwen activations at 0.592 FVE with zeropaired data, beating a ridge map fit on 20k aligned pairs (0.422; reverse direction 0.528 vs 0.240). The gap in that comparison carried the “cross-model interlingua” claim — but ridge is a linearcompetitor. If a nonlinear adapter closes the gap, the claim deflates to “language beats linear alignment.” This experiment runs that check before the claim escapes into any external writeup.
Challengers: a 2-layer GELU MLP (hidden 1024–8192, 7 configs, val-selected, early stopping) and RBF kernel ridge (γ, α swept) — generous selection budgets on purpose: the harder we try, the stronger a null. Same train pairs (UFW docs 102k–106k), same fixed 1000-pair eval set, and the scoring contract replicated from ridge_scaling.py bit-for-bit: unit-normalized rows, predictions renormalized to ‖·‖=√d, identical denominator and data permutation, eval untouched by selection.
Nonlinearity is not the missing ingredient
| route at 19995 pairs | g→q FVE | q→g FVE |
|---|---|---|
| text route (0 pairs) | 0.592 | 0.528 |
| RBF kernel ridge | 0.429 | 0.278 |
| ridge (linear) | 0.422 | 0.240 |
| MLP, val-selected | 0.397 | 0.199 |
MLP capacity costs more in estimation error than it buys in expressivity at this n — the val-selected winner was the widest net in one direction, the deepest in the other, and both still lost to ridge.
What is an FVE worth? The capacity panel
Reconstruction numbers have been quoted in raw FVE throughout the series — including the release’s headline 0.775 — with no unit conversion. The natural calibration: the rank r* at which an optimal linear code(PCA, fit on the 19995 train vectors, scored on the eval golds under the same contract) achieves the same FVE. First, the spectra kill the cheap-eval worry: gemma L41’s top component carries 7.0% of variance, 50% needs rank 767, 90% needs 3138 (qwen L20: 2.9% / 171 / 1689). This eval set is not secretly low-rank.
| number | space | FVE | r* | of d |
|---|---|---|---|---|
| NLA same-model round-trip | gemma L41 | 0.774 | 2927 | 5376 |
| NLA same-model round-trip | qwen L20 | 0.752 | 929 | 3584 |
| text route, qwen→gemma | gemma L41 | 0.528 | 1344 | 5376 |
| text route, gemma→qwen | qwen L20 | 0.592 | 389 | 3584 |
| finetuned-model drift decode | gemma L41 | 0.420 | 831 | 5376 |
| recursion, round 7 | gemma L41 | 0.416 | 814 | 5376 |
| RBF adapter @20k, q→g | gemma L41 | 0.278 | 324 | 5376 |
| ridge @20k, q→g | gemma L41 | 0.240 | 222 | 5376 |
| RBF adapter @20k, g→q | qwen L20 | 0.429 | 168 | 3584 |
| ridge @20k, g→q | qwen L20 | 0.422 | 161 | 3584 |
| MLP @20k, g→q | qwen L20 | 0.397 | 141 | 3584 |
| MLP @20k, q→g | gemma L41 | 0.199 | 134 | 5376 |
r* interpolated on the measured FVE-vs-rank curve; smooth and rising through every quoted region (gemma: 0.713 @ r=2476 → 0.883 @ r=3856).
Pre-registered bracket, scored
- ✓MLP ≈ ridge — gap is informational, not linear (~15% prior)— exceeded: ridge strictly dominated the MLP everywhere; RBF edge ≤ +0.038
- ✗partial close, g→q lands 0.45–0.57 (~55% prior, modal guess 0.50)— 0.429 best — overestimated nonlinearity again, same direction as every prior miss
- ✗adapter matches text route (~30% prior)— not close: 0.16–0.25 FVE short in both directions
- ~ridge asymmetry (0.42/0.24) is partly a linear artifact— RBF closes q→g more (+0.038 vs +0.007) but 0.43/0.28 stays far more asymmetric than text's 0.59/0.53