For more details, please check out our
Blog: yinglunz.com/blogs/ttm.html
Paper: arxiv.org/pdf/2510.07632
Code: github.com/yinglunz/tes...
Joint work with Jiancheng Zhang and Fuzhi Tang. Feedback and thoughts are very welcome!
For more details, please check out our
Blog: yinglunz.com/blogs/ttm.html
Paper: arxiv.org/pdf/2510.07632
Code: github.com/yinglunz/tes...
Joint work with Jiancheng Zhang and Fuzhi Tang. Feedback and thoughts are very welcome!
Two takeaways:
1. Eval lies at the heart of AI progress.
2. Iterative, matching-based self-improvements works -- and should be explored beyond compositional reasoning!
TTM can also be extended to datasets without local groups -- by treating the entire dataset as a global assignment problem between all images and captions (solved in polynomial time).
The global TTM variant achieves up to 33.3% relative error reduction.
TTM isnβt limited to benchmarks with k-by-k groups.
For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.
TTM provides substantial improvements on top of SimpleMatch, without external supervision.
Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.
Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa
To push further, we develop Test-Time Matching (TTM), an iterative, self-improving algorithm with two key components:
(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.
SimpleMatch reveals substantial hidden capability -- it enables SigLIP-B16 to surpass all prior results and GPT-4.1 to achieve the first result surpassing human performance on Winoground.
Because a correct GroupMatch also guarantees a perfect GroupScore, this creates an arbitrage opportunity via a two-step SimpleMatch procedure:
1. Select the most likely matching under GroupMatch.
2. Overfit to that matching at test time.
We introduce a new GroupMatch metric that evaluates the best overall matching instead of isolated pairwise comparisons.
This increases the random-guessing success rate to 1/k! (from 1/6 to 1/2 when k = 2).
The widely used GroupScore metric requires one-to-one alignment between k images and k captions without enforcing consistency -- a single collision means failure.
Under random guessing, the success rate is (k-1)! / (2k-1)! β only 1/6 when k = 2.
Multimodal models, even frontier ones, have long been reported to perform at or below random guessing on compositional reasoning benchmarks.
Why does this happen?
We find that part of the difficulty lies in the evaluation metric itself.
Super excited to share Test-Time Matching (TTM), an iterative, self-improving algorithm that unlocks substantial compositional reasoning capabilities in multimodal models.
TTM enables SigLIP-B16 (~0.2B params) to outperform GPT-4.1 on MMVP-VLM, establishing a new SOTA.
Paper: yinglunz.com/pdfs/dtrl.pdf
Joint work with my student Junkai Luo.
Feedback welcome! π
Our algorithm achieves SOTA performance across multiple benchmarks.
We hope these ideas also inspire improvements to GRPO for LLMsβespecially in credit assignment.
π‘ Building on this insight, we adapt GRPO to online finetuning of DTs, introducing:
β’ Sub-trajectory optimization β better credit assignment
β’ Sequence-level likelihood objectives (concurrent w/ GSPO) β stability & efficiency
β’ Active sampling β improved exploration in uncertain regions
π We identify hindsight return relabeling as the key obstacle: while useful for supervised objectives, it destabilizes importance weights for RL methods like PPO and GRPO.
πExcited to share our new paper:
Online Finetuning Decision Transformers with Pure RL Gradients
RL drives reasoning in LLMsβbut remains underexplored for online finetuning of Decision Transformers (DTs), where most methods still rely mainly on supervised objectives.
Why?
Paper: arxiv.org/pdf/2510.03247
Joint work with my student Jiancheng Zhang.
Feedback welcome!
3/3
Our algorithm combines uncertainty and diversity principles in a modality-aware
design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. It achieves consistent gains over baselines across multiple benchmarks, including COCO and DataComp.
2/3
Sharing new paper: Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data
We extend classical unimodal active learning to the multimodal AL with unaligned data, allowing data-efficient finetuning and pretraining of vision-language models such as CLIP and SigLIP.
1/3
We hope this work inspires more research on adaptive, efficient deployment of LLMsβwhere compute is used strategically rather than blindly.
Joint work with my student Bowen Zuo π
Feedback welcome!
Most methods allocate compute uniformly, ignoring variation in query difficulty.
We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategicallyβjust enough for easy queries and more for hard ones.
π Example (avg. budget = 32):
(2/3)
πExcited to share our new paper: Strategic Scaling of Test-Time Compute: A Bandit Learning Approach.
We turn test-time compute allocation into a bandit learning problem, achieving:
β
+11.10% on MATH-500
β
+7.41% on LiveCodeBench
Paper: arxiv.org/pdf/2506.12721
(1/3)
There is a ton of interest in the question of whether AI can be be funny: www.bbc.com/future/artic.... Our paper at NeurIPS investigates the humor generation capabilities of the latest and greatest AI models using one of worldβs largest humor datasets! arxiv.org/pdf/2406.10522
Iβm recruiting multiple PhD students for Fall 2025 at UCR! If youβre interested in working on efficient ML, RL, and LLMs, please apply to the UCR CS/EE PhD program.
Please visit yinglunz.com for detailed information on research directions and contact instructions.