← NLA experimentsNatural Language Autoencoders · follow-up series
09

Adapters vs the text route — and what an FVE is worth

The owed-honesty check on the interlingua claim: give learned alignment maps every nonlinear advantage, then price every headline number in optimal-linear-code rank.

date
2026-06-11
idea
#18 + capacity panel
data
19995 aligned pairs (gemma L41 ↔ qwen L20), fixed 1000-pair eval
compute
no serving — salvaged artifacts only
0.397 < 0.422
MLP vs ridge, 20k pairs
nonlinear loses to linear at all 10 grid points
+0.007
RBF − ridge at 20k, g→q
+0.038 q→g — the kernel edge asymptotes to noise
90k / 238k
pairs for MLP to match text
log-log extrapolation, decelerating curves
2927 / 5376
rank equiv. of same-model 0.774
L41 is not low-rank; high FVE is not cheap
§1

The question and the contract

Report 01 showed gemma explanations reconstruct qwen activations at 0.592 FVE with zeropaired data, beating a ridge map fit on 20k aligned pairs (0.422; reverse direction 0.528 vs 0.240). The gap in that comparison carried the “cross-model interlingua” claim — but ridge is a linearcompetitor. If a nonlinear adapter closes the gap, the claim deflates to “language beats linear alignment.” This experiment runs that check before the claim escapes into any external writeup.

Challengers: a 2-layer GELU MLP (hidden 1024–8192, 7 configs, val-selected, early stopping) and RBF kernel ridge (γ, α swept) — generous selection budgets on purpose: the harder we try, the stronger a null. Same train pairs (UFW docs 102k–106k), same fixed 1000-pair eval set, and the scoring contract replicated from ridge_scaling.py bit-for-bit: unit-normalized rows, predictions renormalized to ‖·‖=√d, identical denominator and data permutation, eval untouched by selection.

§2

Nonlinearity is not the missing ingredient

Fig. 1 gemma→qwen: eval FVE vs training pairs (log x). The MLP never catches plain ridge; RBF tracks ridge within +0.007 at 20k. The zero-paired-data text route (dashed) sits 0.16 above anything learned.
Fig. 2 qwen→gemma, the hard direction: the kernel closes more of ridge's deficit here (+0.038), so the starker ridge asymmetry was partly a linear-regression artifact — but the best adapter still reaches barely half the text route.
route at 19995 pairsg→q FVEq→g FVE
text route (0 pairs)0.5920.528
RBF kernel ridge0.4290.278
ridge (linear)0.4220.240
MLP, val-selected0.3970.199

MLP capacity costs more in estimation error than it buys in expressivity at this n — the val-selected winner was the widest net in one direction, the deepest in the other, and both still lost to ridge.

FindingTen grid points, zero adapter wins over plain ridge for the MLP; the kernel’s edge shrinks as n grows. Extrapolating the MLP curves (decelerating in log n), matching the text route takes ~9×10⁴ pairs gemma→qwen and ~2.4×10⁵ qwen→gemma. “Language beats learned alignment at 20k pairs” stands with no linearity caveat.
§3

What is an FVE worth? The capacity panel

Reconstruction numbers have been quoted in raw FVE throughout the series — including the release’s headline 0.775 — with no unit conversion. The natural calibration: the rank r* at which an optimal linear code(PCA, fit on the 19995 train vectors, scored on the eval golds under the same contract) achieves the same FVE. First, the spectra kill the cheap-eval worry: gemma L41’s top component carries 7.0% of variance, 50% needs rank 767, 90% needs 3138 (qwen L20: 2.9% / 171 / 1689). This eval set is not secretly low-rank.

Fig. 3 FVE vs PCA rank, gemma L41 space (log x, full rank 5376 ⇒ FVE 1.0 — contract sanity check). The same-model NLA round-trip lands at rank ≈ 2927; the qwen→gemma text route at ≈ 1344; the best learned adapter at ≈ 324. The curve rises smoothly through the region, so the interpolations are well-conditioned.
Fig. 4 Same calibration in qwen L20 space (d=3584). Negative FVE at tiny rank is the renormalize-to-√d contract punishing a near-mean reconstruction blown up to full norm — kept for consistency with every other number in the series.
numberspaceFVEr*of d
NLA same-model round-tripgemma L410.77429275376
NLA same-model round-tripqwen L200.7529293584
text route, qwen→gemmagemma L410.52813445376
text route, gemma→qwenqwen L200.5923893584
finetuned-model drift decodegemma L410.4208315376
recursion, round 7gemma L410.4168145376
RBF adapter @20k, q→ggemma L410.2783245376
ridge @20k, q→ggemma L410.2402225376
RBF adapter @20k, g→qqwen L200.4291683584
ridge @20k, g→qqwen L200.4221613584
MLP @20k, g→qqwen L200.3971413584
MLP @20k, q→ggemma L410.1991345376

r* interpolated on the measured FVE-vs-rank curve; smooth and rising through every quoted region (gemma: 0.713 @ r=2476 → 0.883 @ r=3856).

FindingA ~148-token explanation reconstructs a gemma L41 state as well as an optimal rank-2927 linear code — more than half the ambient dimensionality. Crossing families, the text route carries 2.4× (g→q) to 6× (q→g) the rank-equivalent of the best learned adapter. The deflationary reading — “0.775 is just a few dozen PCs” — is dead.
§4

Pre-registered bracket, scored

Verdict
The interlingua claim survives its owed-honesty check at full strength and comes out upgraded: the right headline is no longer “text beats ridge by 0.17 FVE” but “the text route moves 2.4–6× the linear-rank equivalent of anything learnable from 20k aligned pairs, and the same-model channel operates at rank-2927-of-5376 equivalent — through ~148 tokens of English you can read.” Caveats that ship with it: r* at high rank is an interpolation off a 20k-sample PCA (directionally solid, ±10%-ish); adapter conclusions are bounded by the 2×10⁵-pair extrapolation, not proven beyond it.