3/ The eval “needle” should move to harder + broader benchmarks: SWE‑Bench Pro, Multi‑SWE‑Bench, SWE‑PolyBench.
iSWE‑Agent is top-ranked on Java for Multi‑SWE‑Bench and SWE‑PolyBench - research.ibm.com/blog/ibm-sof... , bsky.app/profile/did:...
3/ The eval “needle” should move to harder + broader benchmarks: SWE‑Bench Pro, Multi‑SWE‑Bench, SWE‑PolyBench.
iSWE‑Agent is top-ranked on Java for Multi‑SWE‑Bench and SWE‑PolyBench - research.ibm.com/blog/ibm-sof... , bsky.app/profile/did:...
2/ I argued last year from a different lens: Verified is becoming non‑discriminative as the leaderboard saturates; measure the frontier slice.
jatinganhotra.dev/blog/swe-age...
1/ OpenAI: SWE‑Bench Verified is no longer a good frontier eval — test/spec mismatch + contamination.
openai.com/index/why-we...
🚀 IBM Research's iSWE-Agent is now #1 on the SWE-PolyBench (full) Java leaderboard🎉
On the Verified subset, iSWE-Agent scores 46.38% on Java — matching Atlassian Rovo Dev and significantly outperforming Prometheus (33.33%).
More details: jatinganhotra.dev/news/
#AI #Java #SWEPolyBench
(repost welcome) The Generative Model Alignment team at IBM Research is looking for next summer interns! Two candidates for two topics
🍰Reinforcement Learning environments for LLMs
🐎Speculative and non-auto regressive generation for LLMs
interested/curious? DM or email ramon.astudillo@ibm.com
4/4 Ready to see how AI really stacks up against human developers?
Join researchers and developers already evaluating patches → swebencharena.com
#AI #SoftwareEngineering #CodeQuality #AIEvaluation #SWEBenchArena
3/4 Unlike other platforms:
🚫 PR Arena: Tracks merge rates, not code quality
🚫 Yupp AI: Known models, not blind
🚫 SWE Arena: General coding, not SWE tasks
✅ SWE-Bench-Arena: Blind quality evaluation of real bug fixes
2/4 SWE-Bench-Arena fills this gap with blind evaluation across 5 dimensions:
• Simplicity
• Readability
• Performance
• Maintainability
• Correctness
No bias. Just quality assessment.
🧵 1/4 Current AI coding benchmarks miss the mark.
Claude 4 Sonnet hits 72.7% on SWE-Bench, but industry data shows code clones rose 48% (8.3% to 12.3%) and refactoring rates dropped from 25% to 10% since AI adoption.
(GitClear: gitclear.com/ai_assistant_code_quality_2025_research)
Try evaluating patches → swebencharena.com
What quality issues have you noticed with AI-generated code?
#AIEvaluation #SWEBenchArena #CodeQuality #AI #SoftwareEngineering
We need diverse perspectives from:
🎓 AI researchers
👩💻 Professional developers
📚 Academic teams
🚀 Startup engineers
Your input shapes the future of AI code evaluation standards.
How it works:
• Real GitHub issues from actual projects
• Side-by-side patch comparison
• Blind evaluation (you don't know which is AI vs human)
• Multi-dimensional quality assessment
Early results are fascinating - some AI solutions are surprisingly elegant, others create hidden technical debt 📊
That's why we built SWE-Bench-Arena - the first blind evaluation platform for AI code quality.
Instead of just "does it work?", we ask:
✅ Is it maintainable?
✅ Will teams understand it?
✅ Does it follow best practices?
✅ Is it unnecessarily complex?
🔍 AI models hit 72%+ on coding benchmarks, but there's a hidden problem...
Recent data shows concerning trends since AI adoption:
• 48% increase in code cloning
• Refactoring dropped from 25% to 10%
• Developers report "missing context" as #1 issue
Are we optimizing for the wrong metrics? 🧵
5. I call it the Visual Complexity Penalty — and I break it down in detail in my latest post:
🔗 jatinganhotra.dev/blog/swe-age...
📊 Includes full leaderboard analysis, complexity breakdown, and takeaways.
RT if you're building SWE agents — or trying to understand their real limits.
4. This isn't a benchmark artifact.
It's a wake-up call.
🧠 Current AI systems cannot effectively combine visual + structural code understanding.
And that's a serious problem for real-world software workflows.
3. It's not just the images.
Multimodal tasks often require multi-file edits and focus on JavaScript-based, user-facing applications rather than Python backends.
The combination of visual reasoning + frontend complexity is devastating.
2. Why the collapse?
📸 90.6% of instances in SWE-bench Multimodal contain visual content.
When images are present, solve rates drop from ~100% to ~25% across all top-performing agents.
1. SWE agents are getting better. Some achieve 70-75% accuracy on code-only benchmarks like SWE-bench Verified.
But when the same models are tested on SWE-bench Multimodal, scores fall to ~30%.
🚨 New Blog Post:
AI agents collapse under visual complexity.
A 73.2% performance drop when images are introduced in SWE-bench Multimodal.
Here's why this matters — and what it tells us about the future of AI in software engineering:
🧵👇
Like the tale of the Emperor's new clothes, sometimes we need fresh eyes on familiar benchmarks.
SWE-Bench Verified shows 73% success rates, but focusing on discriminative subsets reveals a different story: 11%
What really challenges AI agents? Analysis: jatinganhotra.dev/blog/swe-age...
Fascinating finding: When you remove the 156 problems that 61+ agents solve, performance drops dramatically
Top agents: 73% → 11%
This isn't about making things harder - it's about measuring what matters 🎯
jatinganhotra.dev/blog/swe-age...
6/ Ready to benchmark YOUR agent properly?
Dataset available now:
🤗 huggingface.co/datasets/jatinganhotra/SWE-bench_Verified-discriminative
Stop optimizing for saturated benchmarks. Start measuring real progress.
5/ The results are eye-opening:
Claude 4 Opus on full benchmark: 73.2% ✅
Claude 4 Opus on Frontier subset: 11.6% 😬
This isn't just harder - it's revealing what agents ACTUALLY can't do
4/ Solution: 4 targeted subsets that reveal true agent capabilities
Each subset targets different evaluation needs - from maximum sensitivity (Frontier) to real-world complexity (MultiFile)
Performance drops from 73% to as low as 10%!
3/ I analyzed all 500 problems against 83 different SWE-agents
The distribution is shocking:
- 52 problems: ZERO agents can solve
- 26 problems: Only 1-2 agents succeed
- 156 problems: 61+ agents solve easily
2/ The problem: 156/500 problems are solved by 61+ agents
When everyone gets the same questions right, you can't tell who's actually better @anthropic.com
It's like ranking students when everyone scores 95%+ on the easy questions
1/ "What gets measured gets improved" - but are we measuring the right things?
SWE-Bench Verified has driven amazing progress, but with most agents solving 350+ same problems, we need new targets @ofirpress.bsky.social
Enter: discriminative subsets that highlight genuine challenges 🧵