Huge thanks to @ekdeepl.bsky.social , @thomasfel.bsky.social
and my advisor Demba Ba for all the assistance and contributions to this project!
Huge thanks to @ekdeepl.bsky.social , @thomasfel.bsky.social
and my advisor Demba Ba for all the assistance and contributions to this project!
In Vision, SpaDE learns very interesting concepts! On ImageNette, a 10-class subset of ImageNet, for the English Springer class, it shows concepts that indicate the ears, muzzle, eye region, neck, paws, etc!
Do check out the paper: arxiv.org/abs/2503.01822 for more results!
π‘ Results on real model activations: Across vision & language tasks, SpaDE finds monosemantic features better than ReLU, JumpReLU, or TopK SAEs.
It also tiles concepts beautifully.
SpaDE also captures concept heterogeneity, adaptively allocating sparsity levels to different concepts based on their intrinsic dimension, something TopK struggles with.
SpaDE captures nonlinearly separable features better than ReLU, JumpReLU, or TopK SAEs. It also shows very interesting, local receptive fields!
It tiles concept space more effectively, avoiding cross-concept correlations.
π οΈ Our Solution: SpaDE
We designed SpaDE, a novel SAE that explicitly accounts for nonlinear separability and heterogeneous dimensionality. SpaDE projects distances onto the probability simplex.
It recovers previously hidden concepts that standard SAEs completely miss!
π¬ Testing the Assumptions: We analyzed SAEs across different settingsβfrom toy models to real-world neural activations.
Result? SAEs fail when concepts have nonlinear separability (ReLU, JumpReLU) or heterogeneous concepts (TopK).
The Big Idea: SAE encoders impose constraints on the soultion to dictionary learning, which lead to assumptions about concepts.
SAE encoders are linear transformations followed by orthogonal projections onto different sets, which dictate receptive fields and hence assumptions.
New preprint alert!
Do Sparse Autoencoders (SAEs) reveal all concepts a model relies on? Or do they impose hidden biases that shape what we can even detect?
We uncover a fundamental duality between SAE architectures and concepts they can recover.
Link: arxiv.org/abs/2503.01822