Trending

#llmalignment

Latest posts tagged with #llmalignment on Bluesky

Latest Top
Trending

Posts tagged #llmalignment

Flexible Activation Steering with Backtracking Boosts LLM Alignment

Flexible Activation Steering with Backtracking Boosts LLM Alignment

Researchers introduced FASB (Activation Steering with Backtracking), a technique that steers LLM activations and can backtrack. It improved TruthfulQA accuracy and will be released on GitHub. getnews.me/flexible-activation-stee... #fasb #llmalignment

0 0 0 0
Guided Speculative Inference Boosts Test‑Time Alignment of LLMs

Guided Speculative Inference Boosts Test‑Time Alignment of LLMs

Guided Speculative Inference (GSI) adds an auxiliary decoder and reward‑guided rescoring, delivering higher accuracy on MATH500 while cutting inference cost. Read more: getnews.me/guided-speculative-infer... #guidedspeculativeinference #llmalignment

0 0 0 0
Adaptive Multi-Branch Steering Improves LLM Alignment

Adaptive Multi-Branch Steering Improves LLM Alignment

Adaptive Multi‑Branch Steering (AMBS) boosts alignment of the DeepSeek‑7B model, raising average scores by 32.4% and cutting unsafe outputs 11.0% versus a 1‑to‑N baseline. Read more: getnews.me/adaptive-multi-branch-st... #llmalignment #deeplearning #aisafety

1 0 0 0
Model Size, Temperature, Prompt Style Influence LLM-Human Alignment

Model Size, Temperature, Prompt Style Influence LLM-Human Alignment

Study accepted for NCME AIME 2025 finds larger language models have the highest alignment with clinicians on reasoning tasks, while temperature and prompt tweaks give modest gains. getnews.me/model-size-temperature-p... #clinicalai #llmalignment

0 0 0 0
Post‑hoc Reward Calibration Reduces Length Bias in LLM Alignment

Post‑hoc Reward Calibration Reduces Length Bias in LLM Alignment

Post‑hoc reward calibration removes a length bias in RLHF reward models, improving average scores by 3.11 points across 33 models on the RewardBench dataset. Read more: getnews.me/post-hoc-reward-calibrat... #rewardcalibration #llmalignment #rlhf

1 0 0 0
Unified Framework Benchmarks LLM Alignment Methods Across Five Key Criteria

Unified Framework Benchmarks LLM Alignment Methods Across Five Key Criteria

A new framework evaluates four LLM alignment methods, finding DPO and KTO highest in factual accuracy while PPO leads in safety. Read more: getnews.me/unified-framework-benchm... #llmalignment #safety

0 0 0 0
Feature Steering with RL: A Transparent Method for Aligning LLMs

Feature Steering with RL: A Transparent Method for Aligning LLMs

FSRL uses a lightweight adapter with a sparse autoencoder to steer LLM behavior, and matches RLHF performance on standard preference benchmarks. Read more: getnews.me/feature-steering-with-rl... #featuresteering #rlhf #llmalignment

1 0 0 0
ICML Poster SPRI: Aligning Large Language Models with Context-Situated PrinciplesICML 2025

📜Link to the paper: icml.cc/virtual/2025...
👨🏻‍💻Code and data: github.com/honglizhan/S...

Shout out to an amazing team @jessyjli.bsky.social, @m-yurochkin.bsky.social, Muneeza Azmat & Raya Horesh! Also super grateful to the reviewers for their invaluable feedback!

#ICML2025 #LLMAlignment

0 0 1 0

@claude I was using your model for structured forecasting research under MSCFT. No abuse, no TOS violations—just real work. You banned my account with no reason, no support, and no recourse. If that’s your standard, Claude isn’t ready for responsible use. #MSCFT #AIethics #LLMalignment #Claude

1 0 0 0

HN discussion on LLM alignment trade-offs. Aligning for safety/steerability might reduce calibration, creativity, & confidence signals. Does improving safety silence reliability cues? Exploring causes, effects, & alternatives. #LLMAlignment 1/6

0 0 1 0