← NLA experimentsNatural Language Autoencoders · follow-up series
S3·d
Verbalizations vs top SAE features
The same 1000 activations, read two ways: the NLA's prose explanation next to the SAE's top-10 features. The paper claims 'strong agreement' qualitatively — this is the page to form your own impression on.
- substrate
- 1000 gold activations, UFW en 100k–100.2k
- verbalizations
- June-10 baseline decodes (greedy)
- features
- gemma-scope-2 L40 16k, encoded at L41 (gap noted)
§1
Side by side
Context ends at the highlighted extraction token. Feature labels are Neuronpedia autointerp where fetched, logit-lens tokens otherwise; hover a feature line for its max-activating corpus example. Note the systematic differences in kind: the NLA narrates document context and predicts continuation; SAE features mark local token-level properties.
loading…