Publikation
Measuring and Guiding Monosemanticity
Ruben Härle; Felix Friedrich; Manuel Brack; Stephan Wäldchen; Björn Deiseroth; Patrick Schramowski; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2506.19382, Pages 1-37, Computing Research Repository, 2025.
Zusammenfassung
There is growing interest in leveraging mechanistic interpretability and controlla-
bility to better understand and influence the internal dynamics of large language
models (LLMs). However, current methods face fundamental challenges in reliably
localizing and manipulating feature representations. Sparse Autoencoders (SAEs)
have recently emerged as a promising direction for feature extraction at scale,
yet they, too, are limited by incomplete feature isolation and unreliable monose-
manticity. To systematically quantify these limitations, we introduce Feature
Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity
in latent representation. Building on these insights, we propose Guided Sparse
Autoencoders (G-SAE), a method that conditions latent representations on labeled
concepts during training. We demonstrate that reliable localization and disentangle-
ment of target concepts within the latent space improve interpretability, detection
of behavior, and control. Specifically, our evaluations on toxicity detection, writing
style identification, and privacy attribute recognition show that G-SAE not only
enhances monosemanticity but also enables more effective and fine-grained steer-
ing with less quality degradation. Our findings provide actionable guidelines for
measuring and advancing mechanistic interpretability and control of LLMs
