METR: Half of SWE-Bench Passes Fail Real Code Review
awesomeagents.ai/news/metr-swe-bench-main...
#SweBench #Benchmarks #AiCoding
Latest posts tagged with #SweBench on Bluesky
METR: Half of SWE-Bench Passes Fail Real Code Review
awesomeagents.ai/news/metr-swe-bench-main...
#SweBench #Benchmarks #AiCoding
Claude Code is impressive.
GLM-5 hits 77.8 on SWE-bench.
MIT licensed, open = run it locally, modify it, ship it.
→ dub.sh/glm5
#LocalAI
#ClaudeCode #CodingAI #OpenSource #GLM5 #ZhipuAI #LLM #AITools #SWEbench #MoE #AIAgents #DevTools #OpenWeights #CodeWithAI #AIEngineering #MachineLearning
The AGI frenzy is in full swing.
A quick reality check on GLM-5 (dub.sh/glm5): it hit 77.8 on SWE-bench. Solid score, no question.
That said, it’s self-reported on a test literally built for models to pass.
#AGI #GLM5 #SWEbench #AIHype #AICoding
Key lesson I wish I'd applied upfront:
Pivot points (e.g., 'if <10% gain after X effort, reassess').
Running smoke tests on tiny subsets.
would've let me pivot to fresher evals (Live, Pro, etc)
#AIagents #SWEbench #OpenSource #LessonsLearned"
Anthropic just dropped Claude Sonnet 4.6, crushing SWE‑bench at 79.6% while costing only a fifth of Opus. If you’re into AI‑powered coding or scaling enterprise dev, this is a game‑changer. Dive in to see the numbers! #AIcoding #SWEbench #Anthropic
🔗 aidailypost.com/news/anthrop...
GPT‑5.2 Thinking is the new collaborative AI that can code, reason and ship full‑stack web apps end‑to‑end. See how it tackles SWE‑Bench with long‑context and agentic workflows. Dive in! #GPT52Thinking #SWEbench #FullStackAI
🔗 aidailypost.com/news/gpt-52-...
"밀릴 수 없다!" '코드 레드' 발령한 오픈AI, 한 달 만에 'GPT-5.2' 긴급 공개!
#오픈AI 가 구글 #제미나이3 의 맹추격에 최고 수준의 비상 단계를 발령하고, 이전 버전 출시 불과 한 달 만에 'GPT-5.2'를 선보였습니다.
샘 알트만 CEO의 코드 레드가 제대로 통할까?
❓ 이 AI 경쟁 구도, 여러분은 누가 우위를 차지할 것이라고 보시나요?
www.aipostkorea.com/news/article...
#GPT5_2 #코드레드 #AI경쟁 #제미나이3 #추론능력 #SWEBench #코딩AI
Claude Opus 4.5 just crushed the SWE‑bench, topping 7 of 8 languages and beating Sonnet 4.5 by 15%. From Java to Python, it’s the new multilingual coding champ. Dive into the details! #ClaudeOpus45 #SWEbench #AIcoding
🔗 aidailypost.com/news/claude-...
winbuzzer.com/2025/11/24/a...
Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop
#AI #Anthropic #Claude #GenerativeAI #LLM #AgenticAI #AICoding #SoftwareDevelopment #AIModels #Opus45 #SWEbench #AIEfficiency #Developers
Moonshot AI just smashed the SWE‑Bench leaderboard—Kimi K2 Thinking hit 71.3%, outpacing GPT‑5, Claude Sonnet 4.5 and Deepseek‑V3.2. Curious how it handles HTML & React? Dive into the details! #MoonshotAI #KimiK2 #SWEbench
🔗 aidailypost.com/news/moonsho...
Anthropic Haiku 4.5: Lightweight AI Matches Rivals, Cuts Cost, Doubles Speed
#Anthropic #Haiku4.5 #lightweightAImodel #SWEBench #TerminalBench
🤖 Autonomous coding agents hit 77.2% on SWE-bench, showing real progress. Local models now tackle real RAG tasks on consumer hardware, but rising trust risks mean oversight matters more than ever.
aiconnectnews.com/en/2025/10/agentic-codin... #agentic #swebench
logicstar.ai/blog/how-we-...
Our awesome team made SWE-Bench 50x smaller reaching new efficiency heights in evaluating coding agents against it making it faster and easier to measure, improve and iterate 💪🏻
#logicstarai #swebench
SPICE Introduces Automated Labeling for SWE‑Bench Datasets
The SPICE pipeline now auto‑labels SWE‑Bench data, cutting the cost of labeling 1,000 instances from ~$100,000 to $5.10. It also provides the SPICE Bench set with 6,802 labeled cases from 291 projects. getnews.me/spice-introduces-automat... #spice #swebench
The core issue: SWE-bench models might not genuinely solve problems if they can peek at solutions in Git history. The SWE-bench team issued a fix, but skepticism remains regarding the flaw's full impact & initial transparency. #SWEbench 2/6
Qodo Command Enters AI Coding Agent Wars With 71.2% SWE-Bench Score
#AI #SWEbench #Qodo #OpenAI #Anthropic #GPT5 #Coding
winbuzzer.com/2025/08/12/q...
🚨 Big news in open-source AI!
🔥 Refact.ai is now the #1 open-source AI Agent on SWE-bench Verified, setting a new standard for AI-assisted software development.
👉 Read the full story: refact.ai/blog/2025/op...
#AI #OpenSource #SWEbench #DevTools #RefactAI #LLM #AIAgent #OpenAI
📢 Don't overlook this in the wave of releases! #MistralAI has a new coding LLM: it's #Devstral, an open model perfect for on-prem, private and local deployments 🐈
📰 Have a look at the announcement: mistral.ai/news/devstral
#MistralAI #GenAI #LLMs #SWEBench
🧠 Another flagship model released! @anthropic.com just unveiled Claude Opus 4 and Claude Sonnet 4, and they are at the top of the leaderboard for coding 💻
📰 Check out the announcement: www.anthropic.com/news/claude-4
#GenAI #LLMs #Claude #Claude4 #SweBench
#Devstral: New #opensource Model for Coding Agents by #MistralAI & #AllHandsAI 🧠
• 🏆 #Devstral achieves 46.8% on #SWEBench Verified, outperforming previous #opensource models by over 6% points and surpassing #GPT4 mini by 20%
🧵👇#AI #coding
One of the most fun moments at NeurIPS 2024 was the announcement of the Konwinsky Prize. This little clip shows D. Sculley and Chris Welty sharing a million dollar moment. www.kaggle.com/competitions...
youtube.com/shorts/J_OHx...
#swebench #kaggle #oneMILLIONdollars #neurIPS
SWE Bench exists for benchmarking AI based coding ability. Why don't we have the same for SREs?
For that matter why aren't there any good debugging test benches - even for humans?
#ai #sre #benchmarking #swebench
🤖 #Claude35Sonnet now available in #GitHubCopilot! Top performer on #SWEbench with best-in-class #Python accuracy (93.7%). Features: code writing, debugging, test generation & contextual explanations. Rolling out to 100M+ #developers
www.anthropic.com/news/github-...
How do AI software engineering agents work?🤔🤖
Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️
newsletter.pragmaticengineer.com/p/ai-coding-...
Great read! 👏 @gergely.pragmaticengineer.com @hejelin.bsky.social
#AI #SWEbench #SWEagent