#RLVR

1 month ago

RLVR claims it can boost sampling efficiency, but the real win is still the base LLM’s reasoning trajectory. Dive into the NeurIPS 2025 findings on teacher distillation vs. architectural tweaks. Curious? #RLVR #SamplingEfficiency #LLMReasoning

🔗 aidailypost.com/news/rlvr-li...

1 0 0 0

Meng Li

@mengli512.bsky.social

2 months ago

Karpathy’s 2025 Viral Wrap: AI’s 6 Make-or-Break Moments Karpathy 2025 AI recap: 6 game-changers from RLVR to vibe coding—why models got spiky and coding went free.

Karpathy 2025 wrap: RLVR turns LLMs into spiky “ghosts,” Cursor & Claude Code thicken the app layer, Vibe Coding kills syntax, nano-banana GUI next—what’s the first product you’ll toss code at?
#Karpathy #RLVR #Cursor #Claude #VibeCoding
open.substack.com/pub/aidisrup...

2 0 0 0

Gerrit Eicker

@eicker.bsky.social

2 months ago

2025 saw significant advancements in #LLMs, with #ReinforcementLearning from #VerifiableRewards (#RLVR) emerging as a key stage in training, leading to improved #reasoning capabilities. The industry also began to understand the unique “jagged” intelligence of LLMs, excelling in specific domains but…

0 0 0 0

AI Daily Post

@aidailypost.com

4 months ago

New Tsinghua study shows reasoning LLMs run faster but don’t out‑perform on tough tasks. Efficiency up, capability flat—what does this mean for RLVR and chain‑of‑thought tricks? Dive in for the data. #LLM #ChainOfThought #RLVR

🔗 aidailypost.com/news/study-f...

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Chain-of-Thought Strategies Boost Steerable Pluralistic AI Alignment

RLVR outperformed other chain‑of‑thought methods on the Value Kaleidoscope and OpinionQA benchmarks, achieving higher alignment with fewer training examples. getnews.me/chain-of-thought-strateg... #rlvr #chainofthought

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

RLVR Training Shows Shrinkage and Expansion of LLM Reasoning

RLVR training can first tighten, then broaden LLM reasoning via an early exploitation stage and a later exploration stage. The study was submitted on 5 Oct 2025 and classified under cs.LG and cs.AI. getnews.me/rlvr-training-shows-shri... #rlvr #llm

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

RLVR Improves Korean Word‑Chain Game with Curriculum Learning

RLVR merges learning with rewards; curriculum learning gave longer Korean word‑chain sequences and reduced contradictory feedback, study posted 3 Oct 2025. Read more: getnews.me/rlvr-improves-korean-wor... #rlvr #koreanwordchain

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Length‑Aware Sampling Boosts Policy Optimization for LLM Reasoning

Length-aware Sampling for Policy Optimization (LSPO) is a meta-RLVR method that uses response length to curb overthinking, cutting token count. The pre-print was submitted on 1 Oct 2025. getnews.me/length-aware-sampling-bo... #lspo #rlvr

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

DeepSearch adds Monte Carlo Tree Search to RL for LLM reasoning

DeepSearch adds Monte Carlo Tree Search to RL with verifiable rewards, raising a 1.5 B LLM to 62.95% accuracy on math benchmarks while using ~5.7× fewer GPU hours. Read more: getnews.me/deepsearch-adds-monte-ca... #deepsearch #mcts #rlvr

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Hidden-State Method Improves LLM Reasoning in RLVR

Velocity‑Exploiting Rank‑Learning (VERL) leverages hidden‑state metrics—Effective Rank, Velocity and Acceleration to guide RL, achieving up to 21.4% accuracy gain on the Gaokao 2024 benchmark. Read more: getnews.me/hidden-state-method-impr... #rlvr #verl #gaokao2024

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Down‑Sampling Rollouts Boost Efficiency in LLM Reinforcement Learning

PODS (Policy Optimization with Down‑Sampling) cuts RLVR training time by at least 1.7× while matching vanilla GRPO’s peak test accuracy, by selecting a high‑variance subset of rollouts. Read more: getnews.me/down-sampling-rollouts-b... #pods #rlvr

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Study Shows RLVR May Not Expand Reasoning Beyond Base Model

A new study shows RLVR fine‑tuning improves pass@1 scores but shrinks the empirical support set, limiting novel correct answers. Token‑level entropy rose while answer‑level entropy fell. Read more: getnews.me/study-shows-rlvr-may-not... #rlvr #llm #finetuning

1 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Hidden Costs and Evaluation Gaps in RL with Verifiable Rewards

A study of RL with verifiable rewards (RLVR) finds an implicit “RLVR tax” from stricter rewards, noting evaluation gaps and prompt contamination that can inflate gains. getnews.me/hidden-costs-and-evaluat... #rlvr #machinelearning #ai

1 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

Zero-Variance Prompts Boost LLM Reinforcement Learning Performance

RL‑ZVP lifted accuracy by 8.61 pp and pass rate by 7.77 pp on six math‑reasoning benchmarks. It uses entropy‑guided advantage shaping to weight uncertainty tokens from zero‑variance prompts. getnews.me/zero-variance-prompts-bo... #rlvr #llmtraining

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

RLVR Boosts SQL Reasoning Model to State‑of‑the‑Art Accuracy

The RLVR reinforcement‑learning framework hit 73.56% accuracy on the BIRD private test set, rising to 75.68% with self‑consistency, per a September 2025 paper. Read more: getnews.me/rlvr-boosts-sql-reasonin... #rlvr #sql #bird

1 0 0 0

Santi Garcia

@santigarcia.bsky.social

10 months ago

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It i...

New study challenges a key belief about Reinforcement Learning with Verifiable Rewards (RLVR) for #LLMs:
#RLVR boosts efficiency but doesn't create new reasoning skills — #AI base models already had them!
arxiv.org/abs/2504.13837

0 0 0 0

Bäda

@moosbeda.bsky.social

10 months ago

Forscher zweifeln an "Reasoning"-Modellen: Effizienter ja, intelligenter nein Eine neue Studie stellt infrage, ob Reinforcement Learning mit verifizierbaren Belohnungen (RLVR) tatsächlich die Denkfähigkeiten großer Sprachmodelle verbessert – oder lediglich dabei hilft, bekannte...

Reasoning-Modelle sind anscheinend nicht intelligenter, nur effizienter. #LLM #GenAI #RLVR
the-decoder.de/forscher-zwe...

0 0 0 0

Micha the DevOp

@michabbb.bsky.social

11 months ago

• 🧠 Advanced post-training with reinforcement learning with verifiable rewards (#RLVR) using Group Relative Policy Optimization

• 🔮 All models available in 7B, 13B, and 32B sizes, can be fine-tuned on a single H100 GPU

0 0 1 0

Winbuzzer

@winbuzzer.com

1 year ago

Alibaba’s R1-Omni AI Model Expands the Frontier of Emotion Recognition - WinBuzzer R1-Omni utilizes Reinforcement Learning with Verifiable Reward (RLVR), enhancing its reasoning, accuracy, and adaptability.

Alibaba’s R1-Omni AI Model Expands the Frontier of Emotion Recognition

#AI #AlibabaAI #GenAI #R1Omni #EmotionRecognition #China #OpenSourceAI #RLVR #AIModels

0 0 0 0

Meng Li

@mengli512.bsky.social

1 year ago

Alibaba Releases R1-Omni: First Full-Modality Emotion Recognition with DeepSeek-Style RLVR Discover R1-Omni: Alibaba's open-source full-modality LLM that integrates DeepSeek-style RLVR for enhanced emotion recognition across video, audio, and visuals.

DeepSeek’s RLVR now powers a full-modal LLM (video, audio)! Ali Tongyi Lab’s Bo Liefeng team in Hangzhou open-sourced R1-Omni, boosting emotion recognition with enhanced reasoning, comprehension & generalization. What do you think? 🤔🚀

#DeepSeek #RLVR #LLM aidisruption.ai/p/alibaba-re...

0 0 0 0