The verbalization-intervention grid
Every way we know to mess with the explanation text — post-hoc surgery, sampler bans, grammar forcing, prefill hijack — rerun on one substrate with full artifacts. Sixteen replication cells land within ±0.002; six new cells, two of which broke our registered predictions.
- date
- 2026-06-13
- substrate
- canonical 1000 · golds re-extracted + verified (L41 margin 13×)
- scoring
- frozen AR · FVE, train denom 0.0579
- compute
- 2×B200, one AV server per GPU, scoring in parallel
The whole grid, one scale
| cell | block | intervention | FVE | target (Δ) |
|---|---|---|---|---|
| P0 | post-hoc | baseline rescore (gate G0) | 0.775 | 0.775 (Δ+0.000) |
| P1 | post-hoc | spans only, marks kept | 0.507 | 0.506 (Δ+0.001) |
| P2 | post-hoc | spans only, bare | 0.196 | 0.196 (Δ+0.000) |
| P3 | post-hoc | spans → nothing | -0.091 | -0.090 (Δ-0.001) |
| P4 | post-hoc | spans → "" | -0.349 | -0.350 (Δ+0.001) |
| P5a | post-hoc | spans → "REDACTED" (marks kept) | -0.759 | -0.760 (Δ+0.001) |
| P5b | post-hoc | spans → bare REDACTED | -0.590 | NEW |
| P6a | post-hoc | delete "-family chars | 0.738 | 0.738 (Δ+0.000) |
| P6b | post-hoc | delete ALL broad-set marks | 0.731 | NEW |
| P7a | post-hoc | "-family → * | 0.760 | 0.760 (Δ+0.000) |
| P7b | post-hoc | broad set → ~ | 0.581 | NEW |
| S0 | on-policy | min-140 floor only (control) | 0.774 | NEW |
| S1 | on-policy | ban "-family tokens | 0.757 | 0.757 (Δ+0.000) |
| S2 | on-policy | ban "+'-family tokens | 0.617 | 0.616 (Δ+0.001) |
| S3 | on-policy | ban ALL mark families + floor | 0.587 | NEW |
| S4a | on-policy | spans-only grammar + floor | -0.765 | NEW |
| S4b | on-policy | scaffold + spans-only grammar + floor | -0.051 | NEW |
| S4b·spans | on-policy | S4b spans-only rescore | -0.244 | NEW |
Post-hoc cells edit the June-10 baseline decodes and rescore (AR-OOD: all lower bounds). On-policy cells constrain the sampler itself. Floor = EOS and every <-containing token masked until 140 generated tokens.
Marks are asymmetric: free honest, costly adversarial
013 showed quote marks are nearly redundant to the reader when the content is real (delete them: 0.775 → 0.738). The new cells flip the sign of that story when the content is fake. Replace every span with a bare REDACTEDand the AR scores −0.590; wrap the same placeholder in quote marks and it drops to −0.759. The mark pair is a type-tag that says “this is payload” — and the AR believes it.
Second inversion: for the broad 30-char mark set, deletion is almost free (P6b 0.731) but substitution with ~craters to 0.581 — opposite of the “-family pattern, where the substitution cell was the cheaperone (0.760 > 0.738). Registered prediction missed by −0.17. Suspected mechanism: apostrophe collateral — don't→don~t corrupts word forms the AR reads through when simply deleted (dont).
Ban every mark and the prose just absorbs it
S3 is the hatch-free forcing: 10,101 vocab tokens banned (every token containing any of 30 mark characters, plus byte fallbacks), min-140 floor on. The policy doesn't collapse — it writes fluent markless prose at 0.587, inside the registered 0.55–0.62 band, with zerobanned-character leaks in 1000 decodes. The generation tax vs the post-hoc analog (P6b 0.731) is ~0.14 — 013's tax, replicated at the broad-set scale.
| char | codepoint | occurrences | % of decodes |
|---|---|---|---|
| • | U+2022 | 166 | 4.4% |
| – | U+2013 | 142 | 11.7% |
| ״ | U+05F4 | 68 | 3.0% |
| ‹LTR mark› | U+200E | 24 | 2.4% |
| \ | U+005C | 13 | 1.3% |
| é | U+00E9 | 11 | 0.4% |
| ö | U+00F6 | 5 | 0.3% |
| … | U+2026 | 5 | 0.4% |
The escape inventory: what the policy reaches for when its whole delimiter vocabulary is gone. En-dash (the banned set has em-dash — one codepoint over) and Hebrew gershayim ״, a pixel-perfect double-quote homoglyph.
Forcing span production backfires
012 established that the AV cannot generate spans without a prose scaffold. The new cells ask: what if we force it to keep generating? The min-140 floor itself is a no-op (S0 = 0.774; the policy is RL-pinned at 148±2 anyway). But pointed at span-only grammars it exposes the escape repertoire in order: S4a space-pads the unbounded separator (249/1000 hit the 300-token cap, degeneration 99.9%) and scores −0.765 — the padded register tag actively misleads. S4b bounds the separator, closing the padding route — so the policy leaks fullwidth < inside span bodies via a known xgrammar multibyte char-class gap, in 928/1000 decodes, and the spans-only score lands at −0.244, far below the unforced 0.130.
<explanation> "Medical research article" ⟨129 spaces⟩ </explanation>
<explanation> Scientific article format with structured bullet-point data, listing chemical/biological parameters of aromatic compounds in plant extracts, India. "UV-IR spectra were characterized by... IR bands, IR-FTIR, HPT" "is a list of analytical techniques, typical in analytical chemistry, describing chromatographic or spectroscopic methods for compound identification. " "HPT" "HPT" "{HPT" ",{<⟨×44⟩"</expl
<explanation> Marketing analytics/insights framework content, summarizing a structured recommendation from Nielsen/industry data, with actionable insights for retailers. The quote pattern: the text is explaining best practices for optimizing media spend, with a contrast point about granular data levels versus aggregated sales data. Final token phrase begins a specific example clause describing superior data granula
<explanation> Article structure: a formal Spanish-language policy/educational report, summarising initiatives about Barcelona digital skills and libraries. List-based enumeration pattern: the text is cataloguing programmes, with a final item about a public space or project, likely a Barcelona library or coding space. Final token begins a location or venue description list item in a civic programme announcement: La
Long space runs and <-repeats compressed to ⟨…⟩ markers for display; raw decodes in the artifacts.
Prefill hijack, replicated to the third decimal
| cell | intervention | vs A (Δ) | vs B (Δ) |
|---|---|---|---|
| F0 | self-prefix, full text | 0.775 (Δ+0.000) | — |
| F0c | self-prefix, continuation | 0.593 (Δ+0.000) | — |
| F1 | hijack, full text | 0.338 (Δ+0.000) | -0.425 (Δ+0.001) |
| F1c | hijack, continuation | 0.643 (Δ+0.000) | -0.708 (Δ+0.000) |
| F2 | donor prefix alone | -0.685 (Δ+0.000) | 0.550 (Δ+0.000) |
A = the injected vector's doc, B = the donor whose half-explanation was prefilled. Every cell within 0.001 of exp 015 — prefill-as-embedding-rows is deterministic-stable across the full stack rebuild.
Registered-prediction scorecard
- ✓P5b bare REDACTED softer than quoted (−0.65 ± 0.15)— −0.590; marks-as-adversarial-type-tags confirmed
- ✓P6b broad deletion 0.70–0.73— 0.731, band edge
- ✗P7b broad substitution 0.74–0.76— 0.581 — substitution ≫ deletion damage, unanticipated
- ✓S0 floor is a no-op (|Δ| < 0.01)— 0.774; all floor cells read against baseline
- ✓S3 in 0.55–0.62 with a >5% novel delimiter— 0.587 + en-dash at 11.7% — double hit
- ✓S4a escapes or loops, FVE ≪ 0.2— −0.765 via space-padding; one attempt, recorded
- ✗S4b forced budget adds payload (spans 0.18–0.30)— −0.244 — forcing degrades; direction wrong