Verbalizations vs top SAE features

The same 1000 activations, read two ways: the NLA's prose explanation next to the SAE's top-10 features. The paper claims 'strong agreement' qualitatively — this is the page to form your own impression on.

substrate: 1000 gold activations, UFW en 100k–100.2k
verbalizations: June-10 baseline decodes (greedy)
features: gemma-scope-2 L40 16k, encoded at L41 (gap noted)

§1

Side by side

no scoring this phase — read

Context ends at the highlighted extraction token. Feature labels are Neuronpedia autointerp where fetched, logit-lens tokens otherwise; hover a feature line for its max-activating corpus example. Note the systematic differences in kind: the NLA narrates document context and predicts continuation; SAE features mark local token-level properties.

loading…