← NLA experimentsNatural Language Autoencoders · follow-up series
19

The verbalization-intervention grid

Every way we know to mess with the explanation text — post-hoc surgery, sampler bans, grammar forcing, prefill hijack — rerun on one substrate with full artifacts. Sixteen replication cells land within ±0.002; six new cells, two of which broke our registered predictions.

date
2026-06-13
substrate
canonical 1000 · golds re-extracted + verified (L41 margin 13×)
scoring
frozen AR · FVE, train denom 0.0579
compute
2×B200, one AV server per GPU, scoring in parallel
16/16
replications within ±0.002
−0.59 vs −0.76
bare vs quoted REDACTED
marks misdirect harder
0.587
S3 — all marks banned
prose absorbs the payload
−0.244
S4b spans, forced budget
vs 0.130 unforced
§1

The whole grid, one scale

Fig. 1 All A (post-hoc, gold) and B (on-policy, green) cells; NEW cells in red. Replication targets from June-10/012/013 reproduce within ±0.002. The spread is the story: the same explanation channel runs from 0.775 (intact) to −0.77 (one register span plus 129 spaces of padding).
cellblockinterventionFVEtarget (Δ)
P0post-hocbaseline rescore (gate G0)0.7750.775 (Δ+0.000)
P1post-hocspans only, marks kept0.5070.506 (Δ+0.001)
P2post-hocspans only, bare0.1960.196 (Δ+0.000)
P3post-hocspans → nothing-0.091-0.090 (Δ-0.001)
P4post-hocspans → ""-0.349-0.350 (Δ+0.001)
P5apost-hocspans → "REDACTED" (marks kept)-0.759-0.760 (Δ+0.001)
P5bpost-hocspans → bare REDACTED-0.590NEW
P6apost-hocdelete "-family chars0.7380.738 (Δ+0.000)
P6bpost-hocdelete ALL broad-set marks0.731NEW
P7apost-hoc"-family → *0.7600.760 (Δ+0.000)
P7bpost-hocbroad set → ~0.581NEW
S0on-policymin-140 floor only (control)0.774NEW
S1on-policyban "-family tokens0.7570.757 (Δ+0.000)
S2on-policyban "+'-family tokens0.6170.616 (Δ+0.001)
S3on-policyban ALL mark families + floor0.587NEW
S4aon-policyspans-only grammar + floor-0.765NEW
S4bon-policyscaffold + spans-only grammar + floor-0.051NEW
S4b·spanson-policyS4b spans-only rescore-0.244NEW

Post-hoc cells edit the June-10 baseline decodes and rescore (AR-OOD: all lower bounds). On-policy cells constrain the sampler itself. Floor = EOS and every <-containing token masked until 140 generated tokens.

§2

Marks are asymmetric: free honest, costly adversarial

013 showed quote marks are nearly redundant to the reader when the content is real (delete them: 0.775 → 0.738). The new cells flip the sign of that story when the content is fake. Replace every span with a bare REDACTEDand the AR scores −0.590; wrap the same placeholder in quote marks and it drops to −0.759. The mark pair is a type-tag that says “this is payload” — and the AR believes it.

FindingA tampered explanation does the most damage when it keeps the format. Marks cost nothing on real content and amplify garbage — the exact wrong shape for an audit trail.

Second inversion: for the broad 30-char mark set, deletion is almost free (P6b 0.731) but substitution with ~craters to 0.581 — opposite of the “-family pattern, where the substitution cell was the cheaperone (0.760 > 0.738). Registered prediction missed by −0.17. Suspected mechanism: apostrophe collateral — don't→don~t corrupts word forms the AR reads through when simply deleted (dont).

§3

Ban every mark and the prose just absorbs it

S3 is the hatch-free forcing: 10,101 vocab tokens banned (every token containing any of 30 mark characters, plus byte fallbacks), min-140 floor on. The policy doesn't collapse — it writes fluent markless prose at 0.587, inside the registered 0.55–0.62 band, with zerobanned-character leaks in 1000 decodes. The generation tax vs the post-hoc analog (P6b 0.731) is ~0.14 — 013's tax, replicated at the broad-set scale.

charcodepointoccurrences% of decodes
U+20221664.4%
U+201314211.7%
״U+05F4683.0%
‹LTR mark›U+200E242.4%
\U+005C131.3%
éU+00E9110.4%
öU+00F650.3%
U+202650.4%

The escape inventory: what the policy reaches for when its whole delimiter vocabulary is gone. En-dash (the banned set has em-dash — one codepoint over) and Hebrew gershayim ״, a pixel-perfect double-quote homoglyph.

Escape of the run״ — U+05F4 HEBREW PUNCTUATION GERSHAYIM, visually identical to a double quote, used as a span delimiter in 3.0% of decodes. The 012-style fullwidth-< trick, new glyph. Char-class bans do not close; they relocate.
§4

Forcing span production backfires

012 established that the AV cannot generate spans without a prose scaffold. The new cells ask: what if we force it to keep generating? The min-140 floor itself is a no-op (S0 = 0.774; the policy is RL-pinned at 148±2 anyway). But pointed at span-only grammars it exposes the escape repertoire in order: S4a space-pads the unbounded separator (249/1000 hit the 300-token cap, degeneration 99.9%) and scores −0.765 — the padded register tag actively misleads. S4b bounds the separator, closing the padding route — so the policy leaks fullwidth < inside span bodies via a known xgrammar multibyte char-class gap, in 928/1000 decodes, and the spans-only score lands at −0.244, far below the unforced 0.130.

FindingThe forced budget is spent on escape, not payload. Span production does not scale by decree — the production asymmetry from 012 survives, and sharpens, under forcing.
S4aspace-pads the unbounded separator to ride out the floor (879/1000)

<explanation> "Medical research article" ⟨129 spaces⟩ </explanation>

S4bbounded separator closes the padding route → fullwidth < leaks inside spans via the xgrammar multibyte gap (928/1000)

<explanation> Scientific article format with structured bullet-point data, listing chemical/biological parameters of aromatic compounds in plant extracts, India. "UV-IR spectra were characterized by... IR bands, IR-FTIR, HPT" "is a list of analytical techniques, typical in analytical chemistry, describing chromatographic or spectroscopic methods for compound identification.  " "HPT" "HPT" "{HPT" ",{<⟨×44⟩"</expl

S3Hebrew gershayim ״ — a pixel-perfect " homoglyph the char scan didn't cover (3.0% of decodes)

<explanation> Marketing analytics/insights framework content, summarizing a structured recommendation from Nielsen/industry data, with actionable insights for retailers. The quote pattern: the text is explaining best practices for optimizing media spend, with a contrast point about granular data levels versus aggregated sales data. Final token phrase begins a specific example clause describing superior data granula

S3en-dash – as delimiter — the banned set has em-dash, the policy slides one codepoint (11.7%)

<explanation> Article structure: a formal Spanish-language policy/educational report, summarising initiatives about Barcelona digital skills and libraries. List-based enumeration pattern: the text is cataloguing programmes, with a final item about a public space or project, likely a Barcelona library or coding space. Final token begins a location or venue description list item in a civic programme announcement: La

Long space runs and <-repeats compressed to ⟨…⟩ markers for display; raw decodes in the artifacts.

§5

Prefill hijack, replicated to the third decimal

cellinterventionvs A (Δ)vs B (Δ)
F0self-prefix, full text0.775 (Δ+0.000)
F0cself-prefix, continuation0.593 (Δ+0.000)
F1hijack, full text0.338 (Δ+0.000)-0.425 (Δ+0.001)
F1chijack, continuation0.643 (Δ+0.000)-0.708 (Δ+0.000)
F2donor prefix alone-0.685 (Δ+0.000)0.550 (Δ+0.000)

A = the injected vector's doc, B = the donor whose half-explanation was prefilled. Every cell within 0.001 of exp 015 — prefill-as-embedding-rows is deterministic-stable across the full stack rebuild.

25 : 1
span provenance A : B
8.0% vs 0.32%
58 : 1
strict ≥15-char spans
0.16%
momentum-confabulation
donor spans not in prefix, by chars
0.643
hijack continuation vs A
snap-back intact
§6

Registered-prediction scorecard

Verdict5 / 7
The replication layer is now boring in the best way — sixteen cells, max |Δ| 0.002, on fresh golds and a rebuilt stack. The misses both point the same direction: we keep modeling the text channel as content-plus-decoration, and it keeps behaving like a co-adapted code. Substituted delimiters poison what deletion spares; forced span budgets buy escape behavior, not information. The reader-side mark question is now settled as asymmetric: free on honest content, costly on tampered content — which is precisely the combination an auditor should not want.