#SweBench

@awesomeagents.bsky.social

2 days ago

METR: Half of SWE-Bench Passes Fail Real Code Review METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

METR: Half of SWE-Bench Passes Fail Real Code Review

awesomeagents.ai/news/metr-swe-bench-main...

#SweBench #Benchmarks #AiCoding

0 0 0 0

Fierce Mind

@fiercemind.bsky.social

4 days ago

Claude Code is impressive.
GLM-5 hits 77.8 on SWE-bench.

MIT licensed, open = run it locally, modify it, ship it.
→ dub.sh/glm5

#LocalAI
#ClaudeCode #CodingAI #OpenSource #GLM5 #ZhipuAI #LLM #AITools #SWEbench #MoE #AIAgents #DevTools #OpenWeights #CodeWithAI #AIEngineering #MachineLearning

4 0 0 0

Fierce Mind

@fiercemind.bsky.social

1 week ago

The AGI frenzy is in full swing.
A quick reality check on GLM-5 (dub.sh/glm5): it hit 77.8 on SWE-bench. Solid score, no question.

That said, it’s self-reported on a test literally built for models to pass.

#AGI #GLM5 #SWEbench #AIHype #AICoding

0 0 0 0

@kkabirrr.bsky.social

3 weeks ago

Key lesson I wish I'd applied upfront:
Pivot points (e.g., 'if <10% gain after X effort, reassess').
Running smoke tests on tiny subsets.
would've let me pivot to fresher evals (Live, Pro, etc)

#AIagents #SWEbench #OpenSource #LessonsLearned"

1 0 0 0

AI Daily Post

@aidailypost.com

3 weeks ago

Anthropic just dropped Claude Sonnet 4.6, crushing SWE‑bench at 79.6% while costing only a fifth of Opus. If you’re into AI‑powered coding or scaling enterprise dev, this is a game‑changer. Dive in to see the numbers! #AIcoding #SWEbench #Anthropic

🔗 aidailypost.com/news/anthrop...

0 0 0 0

AI Daily Post

@aidailypost.com

2 months ago

GPT‑5.2 Thinking is the new collaborative AI that can code, reason and ship full‑stack web apps end‑to‑end. See how it tackles SWE‑Bench with long‑context and agentic workflows. Dive in! #GPT52Thinking #SWEbench #FullStackAI

🔗 aidailypost.com/news/gpt-52-...

0 0 0 0

AI포스트(AIPOST) | 인공지능 전문언론

@aipostkorea.bsky.social

3 months ago

오픈AI, 한 달 만에 'GPT-5.2' 긴급 공개…'코드 레드' 발령 후 구글 추격 따돌릴까 챗GPT 개발사 오픈AI가 구글이 '제미나이 3'로 바짝 추격해 오는 상황에서 이전 버전을 내놓은 지 불과 한 달 만에 새로운 인공지능(AI) 모델 'GPT-5.2'를 공개했다. 오픈AI는 11일(현지시간) GPT-5.2가 전문 지식 업무에서 가장 뛰어난 성능을 제공하며 자체 평가 기

"밀릴 수 없다!" '코드 레드' 발령한 오픈AI, 한 달 만에 'GPT-5.2' 긴급 공개!

#오픈AI 가 구글 #제미나이3 의 맹추격에 최고 수준의 비상 단계를 발령하고, 이전 버전 출시 불과 한 달 만에 'GPT-5.2'를 선보였습니다.

샘 알트만 CEO의 코드 레드가 제대로 통할까?
❓ 이 AI 경쟁 구도, 여러분은 누가 우위를 차지할 것이라고 보시나요?
www.aipostkorea.com/news/article...
#GPT5_2 #코드레드 #AI경쟁 #제미나이3 #추론능력 #SWEBench #코딩AI

1 0 0 0

AI Daily Post

@aidailypost.com

3 months ago

Claude Opus 4.5 just crushed the SWE‑bench, topping 7 of 8 languages and beating Sonnet 4.5 by 15%. From Java to Python, it’s the new multilingual coding champ. Dive into the details! #ClaudeOpus45 #SWEbench #AIcoding

🔗 aidailypost.com/news/claude-...

0 0 0 0

Winbuzzer

@winbuzzer.com

3 months ago

Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop - WinBuzzer Anthropic has released Claude Opus 4.5, claiming an industry-leading 80.9% coding score and introducing "Tool Search" with a promise to reduce agent costs by 85%.

winbuzzer.com/2025/11/24/a...

Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop

#AI #Anthropic #Claude #GenerativeAI #LLM #AgenticAI #AICoding #SoftwareDevelopment #AIModels #Opus45 #SWEbench #AIEfficiency #Developers

1 0 0 0

AI Daily Post

@aidailypost.com

4 months ago

Moonshot AI just smashed the SWE‑Bench leaderboard—Kimi K2 Thinking hit 71.3%, outpacing GPT‑5, Claude Sonnet 4.5 and Deepseek‑V3.2. Curious how it handles HTML & React? Dive into the details! #MoonshotAI #KimiK2 #SWEbench

🔗 aidailypost.com/news/moonsho...

1 0 0 0

Blaze Trends

@blazetrends.bsky.social

4 months ago

Anthropic Haiku 4.5: Lightweight AI Matches Rivals, Cuts Cost, Doubles Speed Anthropic has launched Claude Haiku 4.5, a compact artificial intelligence model designed to deliver high-performance capabilities at a fraction of the cost

Anthropic Haiku 4.5: Lightweight AI Matches Rivals, Cuts Cost, Doubles Speed

#Anthropic #Haiku4.5 #lightweightAImodel #SWEBench #TerminalBench

1 0 0 0

AI Connect News

@aiconnectnews.bsky.social

5 months ago

Agentic Coding Hits 77.2% on SWE-bench as Trust Risks Rise The edge shifts to practical local RAG while legal and dependency risks mount.

🤖 Autonomous coding agents hit 77.2% on SWE-bench, showing real progress. Local models now tackle real RAG tasks on consumer hardware, but rising trust risks mean oversight matters more than ever.

aiconnectnews.com/en/2025/10/agentic-codin... #agentic #swebench

1 0 0 0

Pavel Ivanov

@pivanov.bsky.social

5 months ago

LogicStar.ai | Self-Healing Applications

logicstar.ai/blog/how-we-...

Our awesome team made SWE-Bench 50x smaller reaching new efficiency heights in evaluating coding agents against it making it faster and easier to measure, improve and iterate 💪🏻

#logicstarai #swebench

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

SPICE Introduces Automated Labeling for SWE‑Bench Datasets

The SPICE pipeline now auto‑labels SWE‑Bench data, cutting the cost of labeling 1,000 instances from ~$100,000 to $5.10. It also provides the SPICE Bench set with 6,802 labeled cases from 291 projects. getnews.me/spice-introduces-automat... #spice #swebench

0 0 0 0

Hacker News Companion

@hncompanion.com

6 months ago

The core issue: SWE-bench models might not genuinely solve problems if they can peek at solutions in Git history. The SWE-bench team issued a fix, but skepticism remains regarding the flaw's full impact & initial transparency. #SWEbench 2/6

0 0 1 0

Winbuzzer

@winbuzzer.com

7 months ago

Qodo Command Enters AI Coding Agent Wars With 71.2% SWE-Bench Score

#AI #SWEbench #Qodo #OpenAI #Anthropic #GPT5 #Coding

winbuzzer.com/2025/08/12/q...

0 1 0 0

Mehul Patel

@handle.invalid

9 months ago

🚨 Big news in open-source AI!

🔥 Refact.ai is now the #1 open-source AI Agent on SWE-bench Verified, setting a new standard for AI-assisted software development.

👉 Read the full story: refact.ai/blog/2025/op...

#AI #OpenSource #SWEbench #DevTools #RefactAI #LLM #AIAgent #OpenAI

1 0 0 0

ZanSara

@zansara.bsky.social

9 months ago

📢 Don't overlook this in the wave of releases! #MistralAI has a new coding LLM: it's #Devstral, an open model perfect for on-prem, private and local deployments 🐈

📰 Have a look at the announcement: mistral.ai/news/devstral

#MistralAI #GenAI #LLMs #SWEBench

1 0 0 0

ZanSara

@zansara.bsky.social

9 months ago

🧠 Another flagship model released! @anthropic.com just unveiled Claude Opus 4 and Claude Sonnet 4, and they are at the top of the leaderboard for coding 💻

📰 Check out the announcement: www.anthropic.com/news/claude-4

#GenAI #LLMs #Claude #Claude4 #SweBench

1 0 0 0

Micha the DevOp

@michabbb.bsky.social

9 months ago

#Devstral: New #opensource Model for Coding Agents by #MistralAI & #AllHandsAI 🧠

• 🏆 #Devstral achieves 46.8% on #SWEBench Verified, outperforming previous #opensource models by over 6% points and surpassing #GPT4 mini by 20%

🧵👇#AI #coding

2 0 1 0

Kaggle

@kaggle.com

1 year ago

Konwinsky Prize | Kaggle #swebench #kaggle #oneMILLIONdollars #neurIPS YouTube video by Kaggle

One of the most fun moments at NeurIPS 2024 was the announcement of the Konwinsky Prize. This little clip shows D. Sculley and Chris Welty sharing a million dollar moment. www.kaggle.com/competitions...
youtube.com/shorts/J_OHx...
#swebench #kaggle #oneMILLIONdollars #neurIPS

2 2 0 0

@chrisbattarbee.bsky.social

1 year ago

SWE Bench exists for benchmarking AI based coding ability. Why don't we have the same for SREs?

For that matter why aren't there any good debugging test benches - even for humans?

#ai #sre #benchmarking #swebench

0 0 0 0

Micha the DevOp

@michabbb.bsky.social

1 year ago

Claude 3.5 Sonnet on GitHub Copilot Starting today, the new Claude 3.5 Sonnet begins rolling out on GitHub Copilot, enabling developers to choose Claude 3.5 Sonnet for coding—directly in Visual Studio Code and GitHub.com.

🤖 #Claude35Sonnet now available in #GitHubCopilot! Top performer on #SWEbench with best-in-class #Python accuracy (93.7%). Features: code writing, debugging, test generation & contextual explanations. Rolling out to 100M+ #developers
www.anthropic.com/news/github-...

0 0 0 0

marmelab

@marmelab.bsky.social

1 year ago

How do AI software engineering agents work? Coding agents are the latest promising Artificial Intelligence (AI) tool, and an impressive step up from LLMs. This article is a deep dive into them, with the creators of SWE-bench and SWE-agent.

How do AI software engineering agents work?🤔🤖

Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️

newsletter.pragmaticengineer.com/p/ai-coding-...

Great read! 👏 @gergely.pragmaticengineer.com @hejelin.bsky.social

#AI #SWEbench #SWEagent

1 0 0 0

Posts tagged #SweBench