Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages
https://arxiv.org/abs/2512.08777
Our method of post-training uses on-policy RL where the model trains exclusively on its own generated responses, guided by reward signals from an "judge" LLM model (that doesn't […]