#ICML2025
Is standard RLHF optimal in view of test-time scaling? Unsurprisingly no.
We show a simple change to standard RLHF framework that involves 𝐫𝐞𝐰𝐚𝐫𝐝 𝐜𝐚𝐥𝐢𝐛𝐫𝐚𝐭𝐢𝐨𝐧 and 𝐫𝐞𝐰𝐚𝐫𝐝 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 (suited to test-time procedure) is optimal!
#ICML2025
Is standard RLHF optimal in view of test-time scaling? Unsurprisingly no.
We show a simple change to standard RLHF framework that involves 𝐫𝐞𝐰𝐚𝐫𝐝 𝐜𝐚𝐥𝐢𝐛𝐫𝐚𝐭𝐢𝐨𝐧 and 𝐫𝐞𝐰𝐚𝐫𝐝 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 (suited to test-time procedure) is optimal!
How do language models generalize from information they learn in-context vs. via finetuning? In arxiv.org/abs/2505.00661 we show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. 1/
Gemini is now Pareto-optimal across most price points....Amazon Nova still seems to do a great job at the very lowest.
www.linkedin.com/posts/jonas-...
It's great to see fundamental research contributions from Indian institutions that aren't IIT/IISc/TIFR : telanganatoday.com/hyderabad-ba...
We have always been at war with Eastasia.
Paging @cosmicskeptic.bsky.social . The wine glass is now full ! www.reddit.com/r/singularit...
Context: www.youtube.com/watch?v=160F...
Many more cool examples that people are finding in this reddit thread (including visual story generation, funky editing and style changes, recontexualization and more !):
www.reddit.com/r/singularit...
An example of reasoning in 'pixel space':
Notably we pass the 'room without an elephant' test (medium.com/@avanib28264...)
Image quality is not quite as high as our SOTA Imagen 3 model (see previous post :) ) but the ability to do reasoning in a combination of text and pixel space unlocks some amazing new capabilities like interleaved generation of text and images and just jamming crazy creative ideas with Gemini.
Gemini 2.0 Image output is Live on aistudio.google.com . This was an amazing effort by manygoo many people in the Gemini team and partners at GDM + rest of Google; and I'm so honoured and priveleged to have been part of it. 🧵->
New paper: Simulating Time With Square-Root Space
people.csail.mit.edu/rrw/time-vs-...
It's still hard for me to believe it myself, but I seem to have shown that TIME[t] is contained in SPACE[sqrt{t log t}].
To appear in STOC. Comments are very welcome!
The American Mathematical Society has also started a page to coordinate support for professional mathematics, so far focusing on executive orders impacting the National Science Foundation: www.ams.org/government/g...
A letter of support for the NIH funding of biomedical research, and the damage wrought by imposing severe caps on indirect costs: docs.google.com/forms/d/1Agz...
A hard problem I found for LLMs to get right: 'Which of Quine's two dogmas is about the analytic- synthetic distinction?' it's a common misconception that it's the first. But it's actually *both* (deducible by reading en.m.wikipedia.org/wiki/Two_Dog... carefully)
Imagen 3 (deepmind.google/technologies...) is now the top ranking model on the lmsys image generation arena, by a significant amount. Proud to have been part of the team that built it (and there's even more to come soon !).
This is such amazing work by @abeirami.bsky.social and collaborators. A deep investigation of a simple and practically important idea. Highly relevant to our own work and anywhere else RL is used for Gen AI.
Lol....u'll always be my #1 Nathan. But your post did remind me that some collaborators of mine at Google did some research on content ecosystems :https://arxiv.org/abs/2309.06375
For example, I've been really enjoying following this guys math channel : youtube.com/@oceansofmat... not as slick as 3b1b but still quite informative.
There are many content creators that have made it huge by putting in a lot of work. Good for them but I am confused by how unbalanced the content economy Is. So many small creators creating unique content out there that deserve far more love and support.
We even show you can do this without a specialized heatmap model if you have a good classifier for the badness you want to eliminate by fine-tuning. Simply use a pixel attribution technique like GRADCAM to generate the heatmap !
Surprisingly effective. The problematic parts are changed but everything else remains the same in the fine-tuned model. This is different from an editing model, where 2 rounds of inference are needed to fix the problematic parts.
Then you fine-tune using a combination of DRAFT (arxiv.org/html/2309.17...) and our custom region-aware fine-tuning objective.
you generate a heatmap highlighting the problematic region (e.g. using our previous work on Rich Human Feedback for T2I): arxiv.org/pdf/2312.10240
The idea is simple. If the image from the base model has a region that's (say) NSFW:
Check out our new paper on Focus-N-Fix, a simple and effective approach to Fine-Tuning Text-to-Image Generation models by only fixing regions that were problematic in the image from the base model.
arxiv.org/abs/2501.06481
...Stevenson flatly rejected the Soviet offer, telling Menshikov that he "considered the offer of such assistance highly improper, indiscreet and dangerous to all concerned". Stevenson then reported the incident directly to President Eisenhower."
How far we have fallen.
An interesting tidbit about the late great Adlai Stevenson : Stevenson was approached by Soviet ambassador Menshikov who offered Soviet financial and public relations help to assist him in getting elected if he decided to run...
I am so tired of waiting, aren’t you, For the world to become good and beautiful and kind? Let’s take a knife and cut the world in two- And see what worms are eating At the rind. Langston Hughes
Super proud to have been part of the imagen 3 work and huge shout out to the veo 2 team !