Feature Steering with RL: A Transparent Method for Aligning LLMs
FSRL uses a lightweight adapter with a sparse autoencoder to steer LLM behavior, and matches RLHF performance on standard preference benchmarks. Read more: getnews.me/feature-steering-with-rl... #featuresteering #rlhf #llmalignment
1
0
0
0