Joint work with Xingxing Zhang, @vamvas.bsky.social @ricosennrich.bsky.social and Furu Wei.
Joint work with Xingxing Zhang, @vamvas.bsky.social @ricosennrich.bsky.social and Furu Wei.
Overall, QueST opens new possibilities:
Scalable reasoning data generation
Training specialized generators for hard problems
Reducing dependence on human-labeled data
Future: Real-time difficulty estimation for RL
See more details in our paper.
Thanks for reading!
π§΅5/5
π RESULTS: State-of-the-Art Performance on 8B size. Qwen3-8B-Base trained on our 212K synthetic data matches performance of DeepSeek-R1-671B on LCB!
π§΅4/5
π― OUR SOLUTION: QueST
Two key innovations:
1. Difficulty-aware graph sampling: selects concept combinations that lead to harder problems.
2. Rejection fine-tuning: Trains generators to produce increasingly difficult problems
π§΅3/5
π THE PROBLEM
Current reasoning problems data hits a wall:
- Competitive coding datasets: only 10-30K problems
- Creating hard problems needs PhD-level experts
- Existing synthetic methods haven't specialized on difficulty
π§΅2/5
π₯Introducing new paper: arxiv.org/pdf/2510.17715, QueST β train specialized generators to create challenging coding problems.
From Qwen3-8B-Base
β
100K synthetic problems: better than Qwen3-8B
β
Combining with human written problems: matches DeepSeek-R1-671B
π§΅(1/5)
π’ Announcing the First Workshop on Multilingual and Multicultural Evaluation (MME) at #EACL2026 π²π¦
MME focuses on resources, metrics & methodologies for evaluating multilingual systems! multilingual-multicultural-evaluation.github.io
π
Workshop Mar 24β29, 2026
ποΈ Submit by Dec 19, 2025
We further propose a source-primed multi-turn variant which allows LLMs to first access the entire source document and then conduct multi-turn chat. It achieves the best performance compared to previous settings when using GPT-4-mini, Qwen-2.5-Instruct, and Llama-3.1-Instruct.
We found that multi-turn translation can achieve clearly better performance as it can access all previous information while not inducing significantly more computation due to KV cache during inference.
We started with a comparison between previous baseline settings: inputting the whole source document at once (single-turn), segment-level translation, and multi-turn translation, where segments are translated progressively with previous ones cached.
I'm thrilled to share my first PhD project, a joint work with
@vamvas.bsky.social and @ricosennrich.bsky.social
Paper link:
arxiv.org/pdf/2503.10494
Long context LLMs have paved the way for document translation, but is simply inputting the whole content the optimal way?
Here's the thread π§΅ [1/n]