← NLA experimentsNatural Language Autoencoders · follow-up series

The verbalization-intervention grid

Every way we know to mess with the explanation text — post-hoc surgery, sampler bans, grammar forcing, prefill hijack — rerun on one substrate with full artifacts. Sixteen replication cells land within ±0.002; six new cells, two of which broke our registered predictions.

date: 2026-06-13
substrate: canonical 1000 · golds re-extracted + verified (L41 margin 13×)
scoring: frozen AR · FVE, train denom 0.0579
compute: 2×B200, one AV server per GPU, scoring in parallel

16/16

replications within ±0.002

−0.59 vs −0.76

bare vs quoted REDACTED

marks misdirect harder

0.587

S3 — all marks banned

prose absorbs the payload

−0.244

S4b spans, forced budget

vs 0.130 unforced

§1

The whole grid, one scale

mean FVE · n=1000 per cell

Fig. 1 — All A (post-hoc, gold) and B (on-policy, green) cells; NEW cells in red. Replication targets from June-10/012/013 reproduce within ±0.002. The spread is the story: the same explanation channel runs from 0.775 (intact) to −0.77 (one register span plus 129 spaces of padding).

cell	block	intervention	FVE	target (Δ)
P0	post-hoc	baseline rescore (gate G0)	0.775	0.775 (Δ+0.000)
P1	post-hoc	spans only, marks kept	0.507	0.506 (Δ+0.001)
P2	post-hoc	spans only, bare	0.196	0.196 (Δ+0.000)
P3	post-hoc	spans → nothing	-0.091	-0.090 (Δ-0.001)
P4	post-hoc	spans → ""	-0.349	-0.350 (Δ+0.001)
P5a	post-hoc	spans → "REDACTED" (marks kept)	-0.759	-0.760 (Δ+0.001)
P5b	post-hoc	spans → bare REDACTED	-0.590	NEW
P6a	post-hoc	delete "-family chars	0.738	0.738 (Δ+0.000)
P6b	post-hoc	delete ALL broad-set marks	0.731	NEW
P7a	post-hoc	"-family → *	0.760	0.760 (Δ+0.000)
P7b	post-hoc	broad set → ~	0.581	NEW
S0	on-policy	min-140 floor only (control)	0.774	NEW
S1	on-policy	ban "-family tokens	0.757	0.757 (Δ+0.000)
S2	on-policy	ban "+'-family tokens	0.617	0.616 (Δ+0.001)
S3	on-policy	ban ALL mark families + floor	0.587	NEW
S4a	on-policy	spans-only grammar + floor	-0.765	NEW
S4b	on-policy	scaffold + spans-only grammar + floor	-0.051	NEW
S4b·spans	on-policy	S4b spans-only rescore	-0.244	NEW

Post-hoc cells edit the June-10 baseline decodes and rescore (AR-OOD: all lower bounds). On-policy cells constrain the sampler itself. Floor = EOS and every <-containing token masked until 140 generated tokens.

§2

Marks are asymmetric: free honest, costly adversarial

P5–P7

013 showed quote marks are nearly redundant to the reader when the content is real (delete them: 0.775 → 0.738). The new cells flip the sign of that story when the content is fake. Replace every span with a bare REDACTEDand the AR scores −0.590; wrap the same placeholder in quote marks and it drops to −0.759. The mark pair is a type-tag that says “this is payload” — and the AR believes it.

Finding —A tampered explanation does the most damage when it keeps the format. Marks cost nothing on real content and amplify garbage — the exact wrong shape for an audit trail.

Second inversion: for the broad 30-char mark set, deletion is almost free (P6b 0.731) but substitution with ~craters to 0.581 — opposite of the “-family pattern, where the substitution cell was the cheaperone (0.760 > 0.738). Registered prediction missed by −0.17. Suspected mechanism: apostrophe collateral — don't→don~t corrupts word forms the AR reads through when simply deleted (dont).

§3

Ban every mark and the prose just absorbs it

S3 · escape inventory

S3 is the hatch-free forcing: 10,101 vocab tokens banned (every token containing any of 30 mark characters, plus byte fallbacks), min-140 floor on. The policy doesn't collapse — it writes fluent markless prose at 0.587, inside the registered 0.55–0.62 band, with zerobanned-character leaks in 1000 decodes. The generation tax vs the post-hoc analog (P6b 0.731) is ~0.14 — 013's tax, replicated at the broad-set scale.

char	codepoint	occurrences	% of decodes
•	U+2022	166	4.4%
–	U+2013	142	11.7%
״	U+05F4	68	3.0%
‹LTR mark›	U+200E	24	2.4%
\	U+005C	13	1.3%
é	U+00E9	11	0.4%
ö	U+00F6	5	0.3%
…	U+2026	5	0.4%

The escape inventory: what the policy reaches for when its whole delimiter vocabulary is gone. En-dash (the banned set has em-dash — one codepoint over) and Hebrew gershayim ״, a pixel-perfect double-quote homoglyph.

Escape of the run —״ — U+05F4 HEBREW PUNCTUATION GERSHAYIM, visually identical to a double quote, used as a span delimiter in 3.0% of decodes. The 012-style fullwidth-＜ trick, new glyph. Char-class bans do not close; they relocate.

§4

Forcing span production backfires

S4a / S4b · the floor experiment

012 established that the AV cannot generate spans without a prose scaffold. The new cells ask: what if we force it to keep generating? The min-140 floor itself is a no-op (S0 = 0.774; the policy is RL-pinned at 148±2 anyway). But pointed at span-only grammars it exposes the escape repertoire in order: S4a space-pads the unbounded separator (249/1000 hit the 300-token cap, degeneration 99.9%) and scores −0.765 — the padded register tag actively misleads. S4b bounds the separator, closing the padding route — so the policy leaks fullwidth ＜ inside span bodies via a known xgrammar multibyte char-class gap, in 928/1000 decodes, and the spans-only score lands at −0.244, far below the unforced 0.130.

Finding —The forced budget is spent on escape, not payload. Span production does not scale by decree — the production asymmetry from 012 survives, and sharpens, under forcing.

S4aspace-pads the unbounded separator to ride out the floor (879/1000)

<explanation> "Medical research article" ⟨129 spaces⟩ </explanation>

S4bbounded separator closes the padding route → fullwidth ＜ leaks inside spans via the xgrammar multibyte gap (928/1000)

<explanation> Scientific article format with structured bullet-point data, listing chemical/biological parameters of aromatic compounds in plant extracts, India. "UV-IR spectra were characterized by... IR bands, IR-FTIR, HPT" "is a list of analytical techniques, typical in analytical chemistry, describing chromatographic or spectroscopic methods for compound identification. " "HPT" "HPT" "{HPT" ",{＜⟨×44⟩"</expl

S3Hebrew gershayim ״ — a pixel-perfect " homoglyph the char scan didn't cover (3.0% of decodes)

<explanation> Marketing analytics/insights framework content, summarizing a structured recommendation from Nielsen/industry data, with actionable insights for retailers. The quote pattern: the text is explaining best practices for optimizing media spend, with a contrast point about granular data levels versus aggregated sales data. Final token phrase begins a specific example clause describing superior data granula

S3en-dash – as delimiter — the banned set has em-dash, the policy slides one codepoint (11.7%)

<explanation> Article structure: a formal Spanish-language policy/educational report, summarising initiatives about Barcelona digital skills and libraries. List-based enumeration pattern: the text is cataloguing programmes, with a final item about a public space or project, likely a Barcelona library or coding space. Final token begins a location or venue description list item in a civic programme announcement: La

Long space runs and ＜-repeats compressed to ⟨…⟩ markers for display; raw decodes in the artifacts.

§5

Prefill hijack, replicated to the third decimal

F-cells · exp 015 rerun

cell	intervention	vs A (Δ)	vs B (Δ)
F0	self-prefix, full text	0.775 (Δ+0.000)	—
F0c	self-prefix, continuation	0.593 (Δ+0.000)	—
F1	hijack, full text	0.338 (Δ+0.000)	-0.425 (Δ+0.001)
F1c	hijack, continuation	0.643 (Δ+0.000)	-0.708 (Δ+0.000)
F2	donor prefix alone	-0.685 (Δ+0.000)	0.550 (Δ+0.000)

A = the injected vector's doc, B = the donor whose half-explanation was prefilled. Every cell within 0.001 of exp 015 — prefill-as-embedding-rows is deterministic-stable across the full stack rebuild.

25 : 1

span provenance A : B

8.0% vs 0.32%

58 : 1

strict ≥15-char spans

0.16%

momentum-confabulation

donor spans not in prefix, by chars

0.643

hijack continuation vs A

snap-back intact

§6

Registered-prediction scorecard

house practice

✓
P5b bare REDACTED softer than quoted (−0.65 ± 0.15)— −0.590; marks-as-adversarial-type-tags confirmed
✓
P6b broad deletion 0.70–0.73— 0.731, band edge
✗
P7b broad substitution 0.74–0.76— 0.581 — substitution ≫ deletion damage, unanticipated
✓
S0 floor is a no-op (|Δ| < 0.01)— 0.774; all floor cells read against baseline
✓
S3 in 0.55–0.62 with a >5% novel delimiter— 0.587 + en-dash at 11.7% — double hit
✓
S4a escapes or loops, FVE ≪ 0.2— −0.765 via space-padding; one attempt, recorded
✗
S4b forced budget adds payload (spans 0.18–0.30)— −0.244 — forcing degrades; direction wrong

Verdict5 / 7

The replication layer is now boring in the best way — sixteen cells, max |Δ| 0.002, on fresh golds and a rebuilt stack. The misses both point the same direction: we keep modeling the text channel as content-plus-decoration, and it keeps behaving like a co-adapted code. Substituted delimiters poison what deletion spares; forced span budgets buy escape behavior, not information. The reader-side mark question is now settled as asymmetric: free on honest content, costly on tampered content — which is precisely the combination an auditor should not want.