#SparseAutoencoders

5 months ago

Interpretability vs Utility in Sparse Autoencoders for LLM Steering

90 SAEs on three LLMs gave a modest rank‑correlation (tau‑b ≈ 0.298) between interpretability and steering, and Delta Token Confidence boosted performance by ~52.5%. Read more: getnews.me/interpretability-vs-util... #sparseautoencoders #llmsteering

0 0 0 0

5 months ago

Sparse Autoencoders Reveal Biases in GPT Model Trained on Austen

A GPT‑style model trained only on Jane Austen’s novels was examined with sparse autoencoders, revealing neurons linked to gender, class and duty, highlighting bias. Read more: getnews.me/sparse-autoencoders-reve... #gpt #sparseautoencoders #austen

1 0 0 0

5 months ago

Sparse Autoencoder Top‑1 Steering Improves Language Model Reasoning

Researchers showed that steering a language model with the top‑1 latent from a sparse autoencoder and signal improves math reasoning, matching MAD baseline. 24 Sep 2025. Read more: getnews.me/sparse-autoencoder-top-1... #sparseautoencoders #reasoning

0 0 0 0

5 months ago

AbsTopK Improves Sparse Autoencoders for Bidirectional Features

AbsTopK, a sparse autoencoder that keeps both positive and negative activations, lets one unit capture opposite concepts. Posted on arXiv (2510.00404) Oct 2025. Read more: getnews.me/abstopk-improves-sparse-... #abstopk #sparseautoencoders

0 0 0 0

5 months ago

OrtSAE: Orthogonal Sparse Autoencoders Boost Feature Discovery

OrtSAE finds about 9% more distinct features and reduces feature absorption by roughly 65% in tests released September 2025, while adding only minimal computational overhead. getnews.me/ortsae-orthogonal-sparse... #ortsae #sparseautoencoders

0 0 0 0

5 months ago

Safe‑SAIL framework maps safety risks in large language models

Researchers introduced Safe‑SAIL, a framework using Sparse Autoencoders to locate safety‑related neurons in large language models, released as a public audit toolkit. Read more: getnews.me/safe-sail-framework-maps... #safesail #sparseautoencoders

0 0 0 0

5 months ago

Sparse Autoencoders Improve Vision Model Interpretability

Sparse autoencoders applied to vision models produce meaningful features, improve out‑of‑distribution performance and enable semantic steering of diffusion image generation. Read more: getnews.me/sparse-autoencoders-impr... #sparseautoencoders #visionai

0 0 0 0