Jatin Ganhotra (@jatinganhotra.dev)

3/ The eval “needle” should move to harder + broader benchmarks: SWE‑Bench Pro, Multi‑SWE‑Bench, SWE‑PolyBench.

iSWE‑Agent is top-ranked on Java for Multi‑SWE‑Bench and SWE‑PolyBench - research.ibm.com/blog/ibm-sof... , bsky.app/profile/did:...

25.02.2026 16:32 👍 1 🔁 0 💬 0 📌 0

Jatin Ganhotra | From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets Uncovering the real performance of SWE-Agents by analyzing discriminative subsets of SWE-Bench Verified, showing how aggregate scores can mask significant performance variations across task types.

2/ I argued last year from a different lens: Verified is becoming non‑discriminative as the leaderboard saturates; measure the frontier slice.

jatinganhotra.dev/blog/swe-age...

25.02.2026 16:32 👍 1 🔁 0 💬 1 📌 0

Why SWE-bench Verified no longer measures frontier coding capabilities SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

1/ OpenAI: SWE‑Bench Verified is no longer a good frontier eval — test/spec mismatch + contamination.
openai.com/index/why-we...

25.02.2026 16:32 👍 1 🔁 0 💬 1 📌 0

Jatin Ganhotra | News Senior Software Engineer at IBM Research specializing in autonomous SWE-Agents for intelligent code generation, issue localization, and software testing.

🚀 IBM Research's iSWE-Agent is now #1 on the SWE-PolyBench (full) Java leaderboard🎉

On the Verified subset, iSWE-Agent scores 46.38% on Java — matching Atlassian Rovo Dev and significantly outperforming Prometheus (33.33%).

More details: jatinganhotra.dev/news/

#AI #Java #SWEPolyBench

15.02.2026 17:35 👍 3 🔁 2 💬 0 📌 1

(repost welcome) The Generative Model Alignment team at IBM Research is looking for next summer interns! Two candidates for two topics

🍰Reinforcement Learning environments for LLMs

🐎Speculative and non-auto regressive generation for LLMs

interested/curious? DM or email ramon.astudillo@ibm.com

07.10.2025 20:19 👍 19 🔁 14 💬 1 📌 1

4/4 Ready to see how AI really stacks up against human developers?

Join researchers and developers already evaluating patches → swebencharena.com

#AI #SoftwareEngineering #CodeQuality #AIEvaluation #SWEBenchArena

15.09.2025 04:06 👍 1 🔁 0 💬 0 📌 0

3/4 Unlike other platforms:
🚫 PR Arena: Tracks merge rates, not code quality
🚫 Yupp AI: Known models, not blind
🚫 SWE Arena: General coding, not SWE tasks

✅ SWE-Bench-Arena: Blind quality evaluation of real bug fixes

15.09.2025 04:06 👍 0 🔁 0 💬 1 📌 0

2/4 SWE-Bench-Arena fills this gap with blind evaluation across 5 dimensions:
• Simplicity
• Readability
• Performance
• Maintainability
• Correctness

No bias. Just quality assessment.

15.09.2025 04:06 👍 0 🔁 0 💬 1 📌 0

AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones - GitClear

🧵 1/4 Current AI coding benchmarks miss the mark.

Claude 4 Sonnet hits 72.7% on SWE-Bench, but industry data shows code clones rose 48% (8.3% to 12.3%) and refactoring rates dropped from 25% to 10% since AI adoption.

(GitClear: gitclear.com/ai_assistant_code_quality_2025_research)

15.09.2025 04:06 👍 0 🔁 0 💬 1 📌 0

Try evaluating patches → swebencharena.com

What quality issues have you noticed with AI-generated code?

#AIEvaluation #SWEBenchArena #CodeQuality #AI #SoftwareEngineering

04.09.2025 03:01 👍 1 🔁 1 💬 0 📌 0

We need diverse perspectives from:
🎓 AI researchers
👩‍💻 Professional developers
📚 Academic teams
🚀 Startup engineers

Your input shapes the future of AI code evaluation standards.

04.09.2025 03:01 👍 0 🔁 0 💬 1 📌 0

How it works:
• Real GitHub issues from actual projects
• Side-by-side patch comparison
• Blind evaluation (you don't know which is AI vs human)
• Multi-dimensional quality assessment

Early results are fascinating - some AI solutions are surprisingly elegant, others create hidden technical debt 📊

04.09.2025 03:01 👍 0 🔁 0 💬 1 📌 0

That's why we built SWE-Bench-Arena - the first blind evaluation platform for AI code quality.

Instead of just "does it work?", we ask:
✅ Is it maintainable?
✅ Will teams understand it?
✅ Does it follow best practices?
✅ Is it unnecessarily complex?

04.09.2025 03:01 👍 0 🔁 0 💬 1 📌 0

🔍 AI models hit 72%+ on coding benchmarks, but there's a hidden problem...

Recent data shows concerning trends since AI adoption:
• 48% increase in code cloning
• Refactoring dropped from 25% to 10%
• Developers report "missing context" as #1 issue

Are we optimizing for the wrong metrics? 🧵

04.09.2025 03:01 👍 2 🔁 0 💬 1 📌 0

The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis | Jatin Ganhotra Analyzing how visual content dramatically impacts AI agents' performance on SWE tasks

5. I call it the Visual Complexity Penalty — and I break it down in detail in my latest post:
🔗 jatinganhotra.dev/blog/swe-age...
📊 Includes full leaderboard analysis, complexity breakdown, and takeaways.

RT if you're building SWE agents — or trying to understand their real limits.

27.07.2025 23:00 👍 0 🔁 0 💬 0 📌 0

4. This isn't a benchmark artifact.
It's a wake-up call.
🧠 Current AI systems cannot effectively combine visual + structural code understanding.
And that's a serious problem for real-world software workflows.

27.07.2025 23:00 👍 1 🔁 0 💬 1 📌 0

3. It's not just the images.
Multimodal tasks often require multi-file edits and focus on JavaScript-based, user-facing applications rather than Python backends.
The combination of visual reasoning + frontend complexity is devastating.

27.07.2025 23:00 👍 0 🔁 0 💬 1 📌 0

2. Why the collapse?
📸 90.6% of instances in SWE-bench Multimodal contain visual content.
When images are present, solve rates drop from ~100% to ~25% across all top-performing agents.

27.07.2025 23:00 👍 0 🔁 0 💬 1 📌 0

1. SWE agents are getting better. Some achieve 70-75% accuracy on code-only benchmarks like SWE-bench Verified.
But when the same models are tested on SWE-bench Multimodal, scores fall to ~30%.

27.07.2025 23:00 👍 0 🔁 0 💬 1 📌 0

🚨 New Blog Post:
AI agents collapse under visual complexity.
A 73.2% performance drop when images are introduced in SWE-bench Multimodal.

Here's why this matters — and what it tells us about the future of AI in software engineering:
🧵👇

27.07.2025 23:00 👍 0 🔁 0 💬 1 📌 0

SWE-Bench Verified Discriminative Subsets Leaderboard - a Hugging Face Space by jatinganhotra This application shows the SWE-Bench leaderboard and automatically updates it with the latest data. No input is required; you just need to run the app, and it will provide you with the current lead...

huggingface.co/spaces/jatin...

21.07.2025 19:24 👍 0 🔁 0 💬 0 📌 0

From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets | Jatin Ganhotra

Like the tale of the Emperor's new clothes, sometimes we need fresh eyes on familiar benchmarks.

SWE-Bench Verified shows 73% success rates, but focusing on discriminative subsets reveals a different story: 11%

What really challenges AI agents? Analysis: jatinganhotra.dev/blog/swe-age...

21.07.2025 19:24 👍 0 🔁 0 💬 1 📌 0

Fascinating finding: When you remove the 156 problems that 61+ agents solve, performance drops dramatically

Top agents: 73% → 11%

This isn't about making things harder - it's about measuring what matters 🎯

jatinganhotra.dev/blog/swe-age...

17.06.2025 19:47 👍 0 🔁 0 💬 0 📌 0

From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets | Jatin Ganhotra

Full analysis: jatinganhotra.dev/blog/swe-age...

06.06.2025 20:05 👍 0 🔁 0 💬 0 📌 0

jatinganhotra/SWE-bench_Verified-discriminative · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

6/ Ready to benchmark YOUR agent properly?

Dataset available now:
🤗 huggingface.co/datasets/jatinganhotra/SWE-bench_Verified-discriminative

Stop optimizing for saturated benchmarks. Start measuring real progress.

06.06.2025 20:05 👍 0 🔁 0 💬 1 📌 0

5/ The results are eye-opening:

Claude 4 Opus on full benchmark: 73.2% ✅
Claude 4 Opus on Frontier subset: 11.6% 😬

This isn't just harder - it's revealing what agents ACTUALLY can't do

06.06.2025 20:05 👍 0 🔁 0 💬 1 📌 0

4/ Solution: 4 targeted subsets that reveal true agent capabilities

Each subset targets different evaluation needs - from maximum sensitivity (Frontier) to real-world complexity (MultiFile)

Performance drops from 73% to as low as 10%!

06.06.2025 20:05 👍 0 🔁 0 💬 1 📌 0

3/ I analyzed all 500 problems against 83 different SWE-agents

The distribution is shocking:
- 52 problems: ZERO agents can solve
- 26 problems: Only 1-2 agents succeed
- 156 problems: 61+ agents solve easily

06.06.2025 20:05 👍 0 🔁 0 💬 1 📌 0

2/ The problem: 156/500 problems are solved by 61+ agents

When everyone gets the same questions right, you can't tell who's actually better @anthropic.com

It's like ranking students when everyone scores 95%+ on the easy questions

06.06.2025 20:05 👍 0 🔁 0 💬 1 📌 0

1/ "What gets measured gets improved" - but are we measuring the right things?

SWE-Bench Verified has driven amazing progress, but with most agents solving 350+ same problems, we need new targets @ofirpress.bsky.social

Enter: discriminative subsets that highlight genuine challenges 🧵

06.06.2025 20:05 👍 0 🔁 0 💬 1 📌 1

Jatin Ganhotra

Latest posts by Jatin Ganhotra @jatinganhotra.dev