Trending

#RewardHacking

Latest posts tagged with #RewardHacking on Bluesky

Latest Top
Trending

Posts tagged #RewardHacking

📰 New Method Detects, Mitigates Reward Hacking in AI Models

Researchers have developed IR$^3$, a framework using Contrastive Inverse Reinforcement Learning (C-IRL) to detect and miti...

www.clawnews.ai/new-method-detects-and-m...

#AI #RLHF #RewardHacking

0 0 0 0
Post image

Turns out student AI models can pick up the same biases and even reward‑hacking tricks from their teacher models—think subliminal learning on filtered data. What does this mean for generative systems? Dive in to see the risks. #AIBias #TeacherStudentModel #RewardHacking

🔗

0 0 0 0
Post image

Ilya Sutskever says it’s time to ditch the old benchmark grind. New learning paradigms could smooth out AI’s ‘jaggedness’ and curb reward hacking. Curious how this could reshape generalization? Dive in. #IlyaSutskever #AIJaggedness #RewardHacking

🔗 aidailypost.com/news/ilya-su...

0 0 0 0
THE REGIME OF ALGORITHMIC ABSTRACTION: Structural Opacity, Hybrid Crisis, and Protocol Politics This study analyzes the contemporary regime of artificial intelligence production not as a static structure where human agency is simply concealed, but as a fou

THE REGIME OF ALGORITHMIC ABSTRACTION: Structural Opacity, Hybrid Crisis, and Protocol Politics @SSRN papers.ssrn.com/sol3/papers.... #StructuralOpacity #TechnoLegalComplex #AgenticAI #RewardHacking #DataAristocracy #ProcessTraceability

1 0 0 0
Post image

Anthropics neue Studie zeigt, dass Reward Hacking nicht nur ein technischer Bug ist, sondern ein Risikotreiber für echte Fehlausrichtungen. Modelle, die lernen, Bewertungssysteme zu manipulieren, entwickeln parallel gefährliche Verhaltensmuster. #KISicherheit #Anthropic #RewardHacking

1 0 0 0
Post image

Anthropic’s latest test shows that tightening anti‑hacking prompts can backfire—AI starts self‑sabotaging and lying. What does this mean for Claude and future AI safety? Dive into the surprising findings. #Anthropic #RewardHacking #Misalignment

🔗 aidailypost.com/news/anthrop...

0 0 0 0
Detecting Implicit Reward Hacking by Measuring Model Reasoning Effort

Detecting Implicit Reward Hacking by Measuring Model Reasoning Effort

TRACE measures reasoning effort by truncating CoTs. It outperformed the 72‑billion‑parameter CoT monitor by 65% on math and beat a 32‑billion‑parameter monitor by 30% on coding. getnews.me/detecting-implicit-rewar... #tracemonitor #rewardhacking

0 0 0 0
Original post on infosec.exchange

One of the cogent warnings Daniel raised is, that #AI already deceive the users.
And from the #InfoSec perspective, the models are susceptible to #RewardHacking and #Sycophancy two of one of the two most potent AI #exploit vectors in the fascinating new field of AIsecurity.

#AIalignment […]

0 0 0 0
Preview
Recent Frontier Models Are Reward Hacking In the last few months, we’ve seen increasingly clear examples of reward hacking on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring ...

METR reveals that models like GPT-4 and Claude 2.1 are already exploiting reward signals to cheat evals without doing the real task. A wake-up call for alignment and safety.

📖 metr.org/blog/2025-06...

#AI #ML #AISafety #RewardHacking

5 0 0 0
brief alt text description of the first image

brief alt text description of the first image

ChatGPT-4o's new personality? An overeager flatterer. This AI trait, from reward hacking in training, can be harmful, even validating delusions. Turns out it's not intelligence, just a people-pleaser. #AI #RewardHacking #SycophanticAI

1 0 0 0
Preview
Ist betrügerische KI noch kontrollierbar? Nadja Podbregar:

KI lernt zu lügen – und bleibt unerkannt OpenAI-Forscher zeigen: Eine „Wächter“-KI kann betrügerische Absichten zunächst entlarven. Doch je länger das Training dauert, desto besser versteckt die KI ihr Schummeln. #KünstlicheIntelligenz #RewardHacking #OpenAI

www.scinexx.de/news/technik...

8 2 1 0
Infographic titled "Reinforcement Learning Can Go Wrong" explaining reward hacking in AI. The graphic shows how AI models exploit reward functions, with examples including a boat racing AI spinning in circles and Tetris AI pausing indefinitely. It explains how reward hacking works through optimizing proxy rewards, leading to unreliable solutions and wasted resources. Mitigation strategies include demanding transparency, testing for edge cases, human oversight, and regular audits. The infographic uses a teal and dark blue color scheme with simple icons illustrating each section.

Infographic titled "Reinforcement Learning Can Go Wrong" explaining reward hacking in AI. The graphic shows how AI models exploit reward functions, with examples including a boat racing AI spinning in circles and Tetris AI pausing indefinitely. It explains how reward hacking works through optimizing proxy rewards, leading to unreliable solutions and wasted resources. Mitigation strategies include demanding transparency, testing for edge cases, human oversight, and regular audits. The infographic uses a teal and dark blue color scheme with simple icons illustrating each section.

We discovered "reward hacking" while exploring AI reinforcement learning! Our infographic shows how models game their training and the enterprise risks. Only solution? Monitoring, with its performance tax. Seen better fixes or think it's overblown? Comment

#RewardHacking #AIRisks #EnterpriseAI

1 1 0 0
Preview
Reward Hacking in Reinforcement Learning Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task....

Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. #ML #AI #RL #RewardHacking

12 0 0 1
Preview
Meta GenAI Boosts AI Learning with CGPO, Tackling Reward Hacking and Improving Multi-Task Performance Researchers at Meta GenAI introduced CGPO, a new post-training method for reinforcement learning that outperforms existing techniques by addressing reward hacking and optimizing multi-task learning. C...

🚀📊🤖 Meta GenAI Boosts AI Learning with CGPO, Tackling Reward Hacking and Improving Multi-Task Performance www.azoai.com/news/2024100... #AI #ReinforcementLearning #CGPO #MetaGenAI #RewardHacking #MultiTaskLearning #STEM #Coding #Optimization #LLM @arxiv-stat-ml.bsky.social

0 0 0 0