Yinglun Zhu (@yinglunzhu)

ttm

For more details, please check out our

Blog: yinglunz.com/blogs/ttm.html
Paper: arxiv.org/pdf/2510.07632
Code: github.com/yinglunz/tes...

Joint work with Jiancheng Zhang and Fuzhi Tang. Feedback and thoughts are very welcome!

31.10.2025 18:02 👍 0 🔁 0 💬 0 📌 0

Two takeaways:

1. Eval lies at the heart of AI progress.
2. Iterative, matching-based self-improvements works -- and should be explored beyond compositional reasoning!

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

TTM can also be extended to datasets without local groups -- by treating the entire dataset as a global assignment problem between all images and captions (solved in polynomial time).

The global TTM variant achieves up to 33.3% relative error reduction.

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

TTM isn’t limited to benchmarks with k-by-k groups.

For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

TTM provides substantial improvements on top of SimpleMatch, without external supervision.

Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.

Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

To push further, we develop Test-Time Matching (TTM), an iterative, self-improving algorithm with two key components:

(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

SimpleMatch reveals substantial hidden capability -- it enables SigLIP-B16 to surpass all prior results and GPT-4.1 to achieve the first result surpassing human performance on Winoground.

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

Because a correct GroupMatch also guarantees a perfect GroupScore, this creates an arbitrage opportunity via a two-step SimpleMatch procedure:

1. Select the most likely matching under GroupMatch.
2. Overfit to that matching at test time.

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

We introduce a new GroupMatch metric that evaluates the best overall matching instead of isolated pairwise comparisons.

This increases the random-guessing success rate to 1/k! (from 1/6 to 1/2 when k = 2).

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

The widely used GroupScore metric requires one-to-one alignment between k images and k captions without enforcing consistency -- a single collision means failure.

Under random guessing, the success rate is (k-1)! / (2k-1)! → only 1/6 when k = 2.

31.10.2025 18:02 👍 1 🔁 0 💬 1 📌 0

Multimodal models, even frontier ones, have long been reported to perform at or below random guessing on compositional reasoning benchmarks.

Why does this happen?

We find that part of the difficulty lies in the evaluation metric itself.

31.10.2025 18:02 👍 0 🔁 0 💬 1 📌 0

Super excited to share Test-Time Matching (TTM), an iterative, self-improving algorithm that unlocks substantial compositional reasoning capabilities in multimodal models.

TTM enables SigLIP-B16 (~0.2B params) to outperform GPT-4.1 on MMVP-VLM, establishing a new SOTA.

31.10.2025 18:02 👍 3 🔁 0 💬 1 📌 0

Paper: yinglunz.com/pdfs/dtrl.pdf

Joint work with my student Junkai Luo.
Feedback welcome! 🙌

14.10.2025 19:04 👍 0 🔁 0 💬 0 📌 0

Our algorithm achieves SOTA performance across multiple benchmarks.

We hope these ideas also inspire improvements to GRPO for LLMs—especially in credit assignment.

14.10.2025 19:04 👍 0 🔁 0 💬 1 📌 0

💡 Building on this insight, we adapt GRPO to online finetuning of DTs, introducing:

• Sub-trajectory optimization → better credit assignment
• Sequence-level likelihood objectives (concurrent w/ GSPO) → stability & efficiency
• Active sampling → improved exploration in uncertain regions

14.10.2025 19:04 👍 1 🔁 0 💬 1 📌 0

🔍 We identify hindsight return relabeling as the key obstacle: while useful for supervised objectives, it destabilizes importance weights for RL methods like PPO and GRPO.

14.10.2025 19:04 👍 0 🔁 0 💬 1 📌 0

🚀Excited to share our new paper:

Online Finetuning Decision Transformers with Pure RL Gradients

RL drives reasoning in LLMs—but remains underexplored for online finetuning of Decision Transformers (DTs), where most methods still rely mainly on supervised objectives.

Why?

14.10.2025 19:04 👍 4 🔁 0 💬 1 📌 0

Paper: arxiv.org/pdf/2510.03247
Joint work with my student Jiancheng Zhang.
Feedback welcome!

3/3

10.10.2025 18:04 👍 0 🔁 0 💬 0 📌 0

Our algorithm combines uncertainty and diversity principles in a modality-aware
design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. It achieves consistent gains over baselines across multiple benchmarks, including COCO and DataComp.

2/3

10.10.2025 18:03 👍 0 🔁 0 💬 1 📌 0

Sharing new paper: Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

We extend classical unimodal active learning to the multimodal AL with unaligned data, allowing data-efficient finetuning and pretraining of vision-language models such as CLIP and SigLIP.

1/3

10.10.2025 18:03 👍 3 🔁 0 💬 1 📌 0

We hope this work inspires more research on adaptive, efficient deployment of LLMs—where compute is used strategically rather than blindly.

Joint work with my student Bowen Zuo 🙌
Feedback welcome!

01.07.2025 18:45 👍 0 🔁 0 💬 0 📌 0

Most methods allocate compute uniformly, ignoring variation in query difficulty.

We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategically—just enough for easy queries and more for hard ones.

📊 Example (avg. budget = 32):

(2/3)

01.07.2025 18:45 👍 0 🔁 0 💬 1 📌 0

🚀Excited to share our new paper: Strategic Scaling of Test-Time Compute: A Bandit Learning Approach.

We turn test-time compute allocation into a bandit learning problem, achieving:
✅ +11.10% on MATH-500
✅ +7.41% on LiveCodeBench

Paper: arxiv.org/pdf/2506.12721

(1/3)

01.07.2025 18:45 👍 1 🔁 0 💬 1 📌 0

There is a ton of interest in the question of whether AI can be be funny: www.bbc.com/future/artic.... Our paper at NeurIPS investigates the humor generation capabilities of the latest and greatest AI models using one of world’s largest humor datasets! arxiv.org/pdf/2406.10522

04.12.2024 14:00 👍 9 🔁 1 💬 0 📌 1

I’m recruiting multiple PhD students for Fall 2025 at UCR! If you’re interested in working on efficient ML, RL, and LLMs, please apply to the UCR CS/EE PhD program.

Please visit yinglunz.com for detailed information on research directions and contact instructions.

03.12.2024 19:52 👍 3 🔁 2 💬 0 📌 0

Yinglun Zhu

Latest posts by Yinglun Zhu @yinglunzhu