#llmalignment

@getnews-me.bsky.social

5 months ago

Flexible Activation Steering with Backtracking Boosts LLM Alignment

Researchers introduced FASB (Activation Steering with Backtracking), a technique that steers LLM activations and can backtrack. It improved TruthfulQA accuracy and will be released on GitHub. getnews.me/flexible-activation-stee... #fasb #llmalignment

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Guided Speculative Inference Boosts Test‑Time Alignment of LLMs

Guided Speculative Inference (GSI) adds an auxiliary decoder and reward‑guided rescoring, delivering higher accuracy on MATH500 while cutting inference cost. Read more: getnews.me/guided-speculative-infer... #guidedspeculativeinference #llmalignment

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Adaptive Multi-Branch Steering Improves LLM Alignment

Adaptive Multi‑Branch Steering (AMBS) boosts alignment of the DeepSeek‑7B model, raising average scores by 32.4% and cutting unsafe outputs 11.0% versus a 1‑to‑N baseline. Read more: getnews.me/adaptive-multi-branch-st... #llmalignment #deeplearning #aisafety

1 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Model Size, Temperature, Prompt Style Influence LLM-Human Alignment

Study accepted for NCME AIME 2025 finds larger language models have the highest alignment with clinicians on reasoning tasks, while temperature and prompt tweaks give modest gains. getnews.me/model-size-temperature-p... #clinicalai #llmalignment

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Post‑hoc Reward Calibration Reduces Length Bias in LLM Alignment

Post‑hoc reward calibration removes a length bias in RLHF reward models, improving average scores by 3.11 points across 33 models on the RewardBench dataset. Read more: getnews.me/post-hoc-reward-calibrat... #rewardcalibration #llmalignment #rlhf

1 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Unified Framework Benchmarks LLM Alignment Methods Across Five Key Criteria

A new framework evaluates four LLM alignment methods, finding DPO and KTO highest in factual accuracy while PPO leads in safety. Read more: getnews.me/unified-framework-benchm... #llmalignment #safety

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Feature Steering with RL: A Transparent Method for Aligning LLMs

FSRL uses a lightweight adapter with a sparse autoencoder to steer LLM behavior, and matches RLHF performance on standard preference benchmarks. Read more: getnews.me/feature-steering-with-rl... #featuresteering #rlhf #llmalignment

1 0 0 0

Hongli Zhan ✈️ ICML

@hongli-zhan.bsky.social

8 months ago

ICML Poster SPRI: Aligning Large Language Models with Context-Situated PrinciplesICML 2025

📜Link to the paper: icml.cc/virtual/2025...
👨🏻‍💻Code and data: github.com/honglizhan/S...

Shout out to an amazing team @jessyjli.bsky.social, @m-yurochkin.bsky.social, Muneeza Azmat & Raya Horesh! Also super grateful to the reviewers for their invaluable feedback!

#ICML2025 #LLMAlignment

0 0 1 0

captbullett

@captbullettp.bsky.social

8 months ago

@claude I was using your model for structured forecasting research under MSCFT. No abuse, no TOS violations—just real work. You banned my account with no reason, no support, and no recourse. If that’s your standard, Claude isn’t ready for responsible use. #MSCFT #AIethics #LLMalignment #Claude

1 0 0 0

Hacker News Companion

@hncompanion.com

10 months ago

HN discussion on LLM alignment trade-offs. Aligning for safety/steerability might reduce calibration, creativity, & confidence signals. Does improving safety silence reliability cues? Exploring causes, effects, & alternatives. #LLMAlignment 1/6

0 0 1 0

Posts tagged #llmalignment