Rosmine (@rosmineb)

[1] arxiv.org/pdf/2402.03300
[2] hijkzzz.notion.site/unraveling-r...

10.01.2025 20:07 👍 0 🔁 0 💬 0 📌 0

One callout is [2] found similar performance between GRPO, RLOO, and REINFORCE++, but that GRPO was more prone to reward hacking, and that critic pretraining with PPO outperforms GRPO.

10.01.2025 20:07 👍 0 🔁 0 💬 1 📌 0

The drawback of GRPO is it requires you generate many responses for the same prompt, so if you were previously generating few responses per prompt, GRPO may increases computation time.

10.01.2025 20:06 👍 0 🔁 0 💬 1 📌 0

Additionally, it moves the KL penalty into the loss function (RLHF usually adds the KL penalty to the rewards), which simplifies the computation of the advantage.

10.01.2025 20:06 👍 0 🔁 0 💬 1 📌 0

To do this, it starts by generating several responses for each query. Then when computing the advantage, it replaces the value function by the reward of the sample normalized by the mean and std across all responses for the same query.

10.01.2025 20:06 👍 0 🔁 0 💬 1 📌 0

Overview of GRPO (Group Relative Policy Optimization)

GRPO is an improvement on PPO introduced in the DeepSeekMath paper

The motivation is that PPO requires 4 large models, a policy, value function, reward model, and reference model. GRPO removes the need for the value model.

10.01.2025 20:06 👍 7 🔁 1 💬 1 📌 0

I want to train a transformer model to be a random number generator

At first it sounds dumb, but you could leverage GPU non-determinism to make it truly random, not just pseudo random

There are better ways to do rng so I still think it's a bad idea, but a cool bad idea

09.12.2024 21:39 👍 1 🔁 0 💬 0 📌 0

- Performance improvement from RLAIF vs. SFT depends on the base model. E.g. For Llama models, SFT is much more effective than RLAIF (see graph)

02.12.2024 17:18 👍 0 🔁 0 💬 0 📌 0

- usually people use SFT generated by gpt 3.5, and rlaif from gpt4. If you use a higher quality model to generate SFT data, then usually RLAIF is less effective than SFT

02.12.2024 17:18 👍 0 🔁 0 💬 1 📌 0

- for RLAIF to be valuable, you need 1. sufficiently strong pretrained base model 2. capability mismatch between the teacher used for the SFT data collection and the critic used for collecting ai feedback

02.12.2024 17:18 👍 0 🔁 0 💬 1 📌 0

A Critical Evaluation of AI Feedback for Aligning Large Language Models

Investigated paradigm of modifying model behavior by first doing SFT training using data from teacher model, then following with RLAIF training by teacher reward model

They found:

02.12.2024 17:17 👍 1 🔁 0 💬 1 📌 0

ha ha nothing that funny, best was "AI/LLM Nerds (derogatory)"

29.11.2024 00:14 👍 1 🔁 0 💬 0 📌 0

All I did was post paper summaries, I guess people don’t like my taste in papers

28.11.2024 19:04 👍 3 🔁 0 💬 1 📌 0

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can im...

(most other approaches just do reasoning through prompting, or require the dataset to have reasoning included)
- Improvement varies a lot on the dataset. There's huge improvement on GSM8K, but ARC improvement is at most 1.6%

arxiv.org/abs/2411.04282

27.11.2024 16:13 👍 1 🔁 0 💬 0 📌 0

- The paper shows how to use variational lower bound that explicitly includes the probability of the reasoning, then uses RL (REINFORCE Leave One Out) to optimize the reasoner. Basically, this gives a good way to train the model on reasoning without specialized training data

27.11.2024 16:13 👍 0 🔁 0 💬 1 📌 0

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
This paper introduces LaTent Reasoning Optimization (LaTRO), a training framework
- Improves zero shot accuracy by +12.5% on GSM8K over base models.
- Doesn't use external feedback or reward models
...

27.11.2024 16:12 👍 1 🔁 0 💬 1 📌 0

Nice to meet you, Mr. Cow 😉

27.11.2024 15:10 👍 1 🔁 0 💬 0 📌 0

I'm suspicious he's real, the pinned tweet makes me think it's a parody (but mostly I'm suspicious because he followed me lol)

27.11.2024 14:06 👍 2 🔁 0 💬 0 📌 0

Fingers crossed this means they're reallocating all compute towards a shiny new model

26.11.2024 20:21 👍 0 🔁 0 💬 0 📌 0

Trying a new ML research area is tough. Every time I think I have a good new idea, I google and find there are already 50,000 papers covering it.

I cannot read 50,000 papers per day

26.11.2024 20:14 👍 0 🔁 0 💬 0 📌 0

(I'm doing a lit review of reasoning for LLMs. Aiming to post 1-2 paper summaries per day. This is the first in the series)
arxiv.org/pdf/2110.14168

25.11.2024 19:28 👍 0 🔁 0 💬 0 📌 0

- Found that training for more than 2 epochs on GSM8K caused performance degradation, because the verifiers require high diversity of samples, and too many epochs causes diversity to collapse

25.11.2024 19:28 👍 0 🔁 0 💬 1 📌 0

- using a verifier and generating too many solution and picking the best actually hurts performance: it increases the probability of finding adversarial solutions that fool the verifier

25.11.2024 19:28 👍 0 🔁 0 💬 1 📌 0

- To use verifiers, they train a model to score if a solution is correct or not. They then generate many high temperature solutions, and use the verifier to select the solution most likely to be correct.

25.11.2024 19:27 👍 0 🔁 0 💬 1 📌 0

- GSM8K has 8.5K high quality grade school math word problems. Problems take 2-8 steps to solve, using +,-,x,/, with a focus on diversity of problems.

25.11.2024 19:27 👍 0 🔁 0 💬 1 📌 0

Training Verifiers to Solve Math Word Problems (2021)
- This paper introduced GSM8K, and showed how using verifiers can significantly improve performance (up to 20+ percentage points compared to finetuning, see graph below)

25.11.2024 19:27 👍 1 🔁 0 💬 1 📌 0

Got it, will try out PSGD, thanks!

25.11.2024 00:14 👍 2 🔁 0 💬 0 📌 0

Untuned SOAP beats tuned adamw at ever single step

25.11.2024 00:08 👍 6 🔁 1 💬 0 📌 0

What's up with PSGD at the beginning? Does it need a while to warm up?

25.11.2024 00:04 👍 1 🔁 0 💬 2 📌 0

Adding reminder to never switch between bf16 and fp16 (especially watch out if you’re using old gpus)

24.11.2024 18:14 👍 4 🔁 0 💬 0 📌 0

Rosmine

Latest posts by Rosmine @rosmineb