← NLA experimentsNatural Language Autoencoders · follow-up series

Adapters vs the text route — and what an FVE is worth

The owed-honesty check on the interlingua claim: give learned alignment maps every nonlinear advantage, then price every headline number in optimal-linear-code rank.

date: 2026-06-11
idea: #18 + capacity panel
data: 19995 aligned pairs (gemma L41 ↔ qwen L20), fixed 1000-pair eval
compute: no serving — salvaged artifacts only

0.397 < 0.422

MLP vs ridge, 20k pairs

nonlinear loses to linear at all 10 grid points

+0.007

RBF − ridge at 20k, g→q

+0.038 q→g — the kernel edge asymptotes to noise

90k / 238k

pairs for MLP to match text

log-log extrapolation, decelerating curves

2927 / 5376

rank equiv. of same-model 0.774

L41 is not low-rank; high FVE is not cheap

§1

The question and the contract

method

Report 01 showed gemma explanations reconstruct qwen activations at 0.592 FVE with zeropaired data, beating a ridge map fit on 20k aligned pairs (0.422; reverse direction 0.528 vs 0.240). The gap in that comparison carried the “cross-model interlingua” claim — but ridge is a linearcompetitor. If a nonlinear adapter closes the gap, the claim deflates to “language beats linear alignment.” This experiment runs that check before the claim escapes into any external writeup.

Challengers: a 2-layer GELU MLP (hidden 1024–8192, 7 configs, val-selected, early stopping) and RBF kernel ridge (γ, α swept) — generous selection budgets on purpose: the harder we try, the stronger a null. Same train pairs (UFW docs 102k–106k), same fixed 1000-pair eval set, and the scoring contract replicated from ridge_scaling.py bit-for-bit: unit-normalized rows, predictions renormalized to ‖·‖=√d, identical denominator and data permutation, eval untouched by selection.

§2

Nonlinearity is not the missing ingredient

panel a+b

Fig. 1 — gemma→qwen: eval FVE vs training pairs (log x). The MLP never catches plain ridge; RBF tracks ridge within +0.007 at 20k. The zero-paired-data text route (dashed) sits 0.16 above anything learned.

Fig. 2 — qwen→gemma, the hard direction: the kernel closes more of ridge's deficit here (+0.038), so the starker ridge asymmetry was partly a linear-regression artifact — but the best adapter still reaches barely half the text route.

route at 19995 pairs	g→q FVE	q→g FVE
text route (0 pairs)	0.592	0.528
RBF kernel ridge	0.429	0.278
ridge (linear)	0.422	0.240
MLP, val-selected	0.397	0.199

MLP capacity costs more in estimation error than it buys in expressivity at this n — the val-selected winner was the widest net in one direction, the deepest in the other, and both still lost to ridge.

Finding —Ten grid points, zero adapter wins over plain ridge for the MLP; the kernel’s edge shrinks as n grows. Extrapolating the MLP curves (decelerating in log n), matching the text route takes ~9×10⁴ pairs gemma→qwen and ~2.4×10⁵ qwen→gemma. “Language beats learned alignment at 20k pairs” stands with no linearity caveat.

§3

What is an FVE worth? The capacity panel

panel c

Reconstruction numbers have been quoted in raw FVE throughout the series — including the release’s headline 0.775 — with no unit conversion. The natural calibration: the rank r* at which an optimal linear code(PCA, fit on the 19995 train vectors, scored on the eval golds under the same contract) achieves the same FVE. First, the spectra kill the cheap-eval worry: gemma L41’s top component carries 7.0% of variance, 50% needs rank 767, 90% needs 3138 (qwen L20: 2.9% / 171 / 1689). This eval set is not secretly low-rank.

Fig. 3 — FVE vs PCA rank, gemma L41 space (log x, full rank 5376 ⇒ FVE 1.0 — contract sanity check). The same-model NLA round-trip lands at rank ≈ 2927; the qwen→gemma text route at ≈ 1344; the best learned adapter at ≈ 324. The curve rises smoothly through the region, so the interpolations are well-conditioned.

Fig. 4 — Same calibration in qwen L20 space (d=3584). Negative FVE at tiny rank is the renormalize-to-√d contract punishing a near-mean reconstruction blown up to full norm — kept for consistency with every other number in the series.

number	space	FVE	r*	of d
NLA same-model round-trip	gemma L41	0.774	2927	5376
NLA same-model round-trip	qwen L20	0.752	929	3584
text route, qwen→gemma	gemma L41	0.528	1344	5376
text route, gemma→qwen	qwen L20	0.592	389	3584
finetuned-model drift decode	gemma L41	0.420	831	5376
recursion, round 7	gemma L41	0.416	814	5376
RBF adapter @20k, q→g	gemma L41	0.278	324	5376
ridge @20k, q→g	gemma L41	0.240	222	5376
RBF adapter @20k, g→q	qwen L20	0.429	168	3584
ridge @20k, g→q	qwen L20	0.422	161	3584
MLP @20k, g→q	qwen L20	0.397	141	3584
MLP @20k, q→g	gemma L41	0.199	134	5376

r* interpolated on the measured FVE-vs-rank curve; smooth and rising through every quoted region (gemma: 0.713 @ r=2476 → 0.883 @ r=3856).

Finding —A ~148-token explanation reconstructs a gemma L41 state as well as an optimal rank-2927 linear code — more than half the ambient dimensionality. Crossing families, the text route carries 2.4× (g→q) to 6× (q→g) the rank-equivalent of the best learned adapter. The deflationary reading — “0.775 is just a few dozen PCs” — is dead.

§4

Pre-registered bracket, scored

✓
MLP ≈ ridge — gap is informational, not linear (~15% prior)— exceeded: ridge strictly dominated the MLP everywhere; RBF edge ≤ +0.038
✗
partial close, g→q lands 0.45–0.57 (~55% prior, modal guess 0.50)— 0.429 best — overestimated nonlinearity again, same direction as every prior miss
✗
adapter matches text route (~30% prior)— not close: 0.16–0.25 FVE short in both directions
~
ridge asymmetry (0.42/0.24) is partly a linear artifact— RBF closes q→g more (+0.038 vs +0.007) but 0.43/0.28 stays far more asymmetric than text's 0.59/0.53

Verdict

The interlingua claim survives its owed-honesty check at full strength and comes out upgraded: the right headline is no longer “text beats ridge by 0.17 FVE” but “the text route moves 2.4–6× the linear-rank equivalent of anything learnable from 20k aligned pairs, and the same-model channel operates at rank-2927-of-5376 equivalent — through ~148 tokens of English you can read.” Caveats that ship with it: r* at high rank is an interpolation off a 20k-sample PCA (directionally solid, ±10%-ish); adapter conclusions are bounded by the 2×10⁵-pair extrapolation, not proven beyond it.