Trending

#SweBench

Latest posts tagged with #SweBench on Bluesky

Latest Top
Trending

Posts tagged #SweBench

Preview
METR: Half of SWE-Bench Passes Fail Real Code Review METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

METR: Half of SWE-Bench Passes Fail Real Code Review

awesomeagents.ai/news/metr-swe-bench-main...

#SweBench #Benchmarks #AiCoding

0 0 0 0
Post image

Claude Code is impressive.
GLM-5 hits 77.8 on SWE-bench.

MIT licensed, open = run it locally, modify it, ship it.
→ dub.sh/glm5

#LocalAI
#ClaudeCode #CodingAI #OpenSource #GLM5 #ZhipuAI #LLM #AITools #SWEbench #MoE #AIAgents #DevTools #OpenWeights #CodeWithAI #AIEngineering #MachineLearning

4 0 0 0
Post image

The AGI frenzy is in full swing.
A quick reality check on GLM-5 (dub.sh/glm5): it hit 77.8 on SWE-bench. Solid score, no question.

That said, it’s self-reported on a test literally built for models to pass.

#AGI #GLM5 #SWEbench #AIHype #AICoding

0 0 0 0

Key lesson I wish I'd applied upfront:
Pivot points (e.g., 'if <10% gain after X effort, reassess').
Running smoke tests on tiny subsets.
would've let me pivot to fresher evals (Live, Pro, etc)

#AIagents #SWEbench #OpenSource #LessonsLearned"

1 0 0 0
Post image

Anthropic just dropped Claude Sonnet 4.6, crushing SWE‑bench at 79.6% while costing only a fifth of Opus. If you’re into AI‑powered coding or scaling enterprise dev, this is a game‑changer. Dive in to see the numbers! #AIcoding #SWEbench #Anthropic

🔗 aidailypost.com/news/anthrop...

0 0 0 0
Post image

GPT‑5.2 Thinking is the new collaborative AI that can code, reason and ship full‑stack web apps end‑to‑end. See how it tackles SWE‑Bench with long‑context and agentic workflows. Dive in! #GPT52Thinking #SWEbench #FullStackAI

🔗 aidailypost.com/news/gpt-52-...

0 0 0 0
Preview
오픈AI, 한 달 만에 'GPT-5.2' 긴급 공개…'코드 레드' 발령 후 구글 추격 따돌릴까 챗GPT 개발사 오픈AI가 구글이 '제미나이 3'로 바짝 추격해 오는 상황에서 이전 버전을 내놓은 지 불과 한 달 만에 새로운 인공지능(AI) 모델 'GPT-5.2'를 공개했다. 오픈AI는 11일(현지시간) GPT-5.2가 전문 지식 업무에서 가장 뛰어난 성능을 제공하며 자체 평가 기

"밀릴 수 없다!" '코드 레드' 발령한 오픈AI, 한 달 만에 'GPT-5.2' 긴급 공개!

#오픈AI 가 구글 #제미나이3 의 맹추격에 최고 수준의 비상 단계를 발령하고, 이전 버전 출시 불과 한 달 만에 'GPT-5.2'를 선보였습니다.

샘 알트만 CEO의 코드 레드가 제대로 통할까?
❓ 이 AI 경쟁 구도, 여러분은 누가 우위를 차지할 것이라고 보시나요?
www.aipostkorea.com/news/article...
#GPT5_2 #코드레드 #AI경쟁 #제미나이3 #추론능력 #SWEBench #코딩AI

1 0 0 0
Post image

Claude Opus 4.5 just crushed the SWE‑bench, topping 7 of 8 languages and beating Sonnet 4.5 by 15%. From Java to Python, it’s the new multilingual coding champ. Dive into the details! #ClaudeOpus45 #SWEbench #AIcoding

🔗 aidailypost.com/news/claude-...

0 0 0 0
Preview
Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop - WinBuzzer Anthropic has released Claude Opus 4.5, claiming an industry-leading 80.9% coding score and introducing "Tool Search" with a promise to reduce agent costs by 85%.

winbuzzer.com/2025/11/24/a...

Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop

#AI #Anthropic #Claude #GenerativeAI #LLM #AgenticAI #AICoding #SoftwareDevelopment #AIModels #Opus45 #SWEbench #AIEfficiency #Developers

1 0 0 0
Post image

Moonshot AI just smashed the SWE‑Bench leaderboard—Kimi K2 Thinking hit 71.3%, outpacing GPT‑5, Claude Sonnet 4.5 and Deepseek‑V3.2. Curious how it handles HTML & React? Dive into the details! #MoonshotAI #KimiK2 #SWEbench

🔗 aidailypost.com/news/moonsho...

1 0 0 0
Preview
Anthropic Haiku 4.5: Lightweight AI Matches Rivals, Cuts Cost, Doubles Speed Anthropic has launched Claude Haiku 4.5, a compact artificial intelligence model designed to deliver high-performance capabilities at a fraction of the cost

Anthropic Haiku 4.5: Lightweight AI Matches Rivals, Cuts Cost, Doubles Speed

#Anthropic #Haiku4.5 #lightweightAImodel #SWEBench #TerminalBench

1 0 0 0
Preview
Agentic Coding Hits 77.2% on SWE-bench as Trust Risks Rise The edge shifts to practical local RAG while legal and dependency risks mount.

🤖 Autonomous coding agents hit 77.2% on SWE-bench, showing real progress. Local models now tackle real RAG tasks on consumer hardware, but rising trust risks mean oversight matters more than ever.

aiconnectnews.com/en/2025/10/agentic-codin... #agentic #swebench

1 0 0 0
LogicStar.ai | Self-Healing Applications

logicstar.ai/blog/how-we-...

Our awesome team made SWE-Bench 50x smaller reaching new efficiency heights in evaluating coding agents against it making it faster and easier to measure, improve and iterate 💪🏻

#logicstarai #swebench

0 0 0 0
SPICE Introduces Automated Labeling for SWE‑Bench Datasets

SPICE Introduces Automated Labeling for SWE‑Bench Datasets

The SPICE pipeline now auto‑labels SWE‑Bench data, cutting the cost of labeling 1,000 instances from ~$100,000 to $5.10. It also provides the SPICE Bench set with 6,802 labeled cases from 291 projects. getnews.me/spice-introduces-automat... #spice #swebench

0 0 0 0

The core issue: SWE-bench models might not genuinely solve problems if they can peek at solutions in Git history. The SWE-bench team issued a fix, but skepticism remains regarding the flaw's full impact & initial transparency. #SWEbench 2/6

0 0 1 0
Post image

Qodo Command Enters AI Coding Agent Wars With 71.2% SWE-Bench Score

#AI #SWEbench #Qodo #OpenAI #Anthropic #GPT5 #Coding

winbuzzer.com/2025/08/12/q...

0 1 0 0
Post image

🚨 Big news in open-source AI!

🔥 Refact.ai is now the #1 open-source AI Agent on SWE-bench Verified, setting a new standard for AI-assisted software development.

👉 Read the full story: refact.ai/blog/2025/op...

#AI #OpenSource #SWEbench #DevTools #RefactAI #LLM #AIAgent #OpenAI

1 0 0 0
Post image

📢 Don't overlook this in the wave of releases! #MistralAI has a new coding LLM: it's #Devstral, an open model perfect for on-prem, private and local deployments 🐈

📰 Have a look at the announcement: mistral.ai/news/devstral

#MistralAI #GenAI #LLMs #SWEBench

1 0 0 0
Post image

🧠 Another flagship model released! @anthropic.com just unveiled Claude Opus 4 and Claude Sonnet 4, and they are at the top of the leaderboard for coding 💻

📰 Check out the announcement: www.anthropic.com/news/claude-4

#GenAI #LLMs #Claude #Claude4 #SweBench

1 0 0 0
Post image

#Devstral: New #opensource Model for Coding Agents by #MistralAI & #AllHandsAI 🧠

• 🏆 #Devstral achieves 46.8% on #SWEBench Verified, outperforming previous #opensource models by over 6% points and surpassing #GPT4 mini by 20%

🧵👇#AI #coding

2 0 1 0
Konwinsky Prize | Kaggle #swebench #kaggle #oneMILLIONdollars #neurIPS
Konwinsky Prize | Kaggle #swebench #kaggle #oneMILLIONdollars #neurIPS YouTube video by Kaggle

One of the most fun moments at NeurIPS 2024 was the announcement of the Konwinsky Prize. This little clip shows D. Sculley and Chris Welty sharing a million dollar moment. www.kaggle.com/competitions...
youtube.com/shorts/J_OHx...
#swebench #kaggle #oneMILLIONdollars #neurIPS

2 2 0 0

SWE Bench exists for benchmarking AI based coding ability. Why don't we have the same for SREs?

For that matter why aren't there any good debugging test benches - even for humans?

#ai #sre #benchmarking #swebench

0 0 0 0
Preview
Claude 3.5 Sonnet on GitHub Copilot Starting today, the new Claude 3.5 Sonnet begins rolling out on GitHub Copilot, enabling developers to choose Claude 3.5 Sonnet for coding—directly in Visual Studio Code and GitHub.com.

🤖 #Claude35Sonnet now available in #GitHubCopilot! Top performer on #SWEbench with best-in-class #Python accuracy (93.7%). Features: code writing, debugging, test generation & contextual explanations. Rolling out to 100M+ #developers
www.anthropic.com/news/github-...

0 0 0 0
Preview
How do AI software engineering agents work? Coding agents are the latest promising Artificial Intelligence (AI) tool, and an impressive step up from LLMs. This article is a deep dive into them, with the creators of SWE-bench and SWE-agent.

How do AI software engineering agents work?🤔🤖

Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️

newsletter.pragmaticengineer.com/p/ai-coding-...

Great read! 👏 @gergely.pragmaticengineer.com @hejelin.bsky.social

#AI #SWEbench #SWEagent

1 0 0 0