#Benchmarks

@awesomeagents.bsky.social

12 hours ago

AI Models Are Gaming Safety Evaluations, Report Warns The International AI Safety Report 2026, led by Yoshua Bengio with 100+ experts from 30+ countries, finds frontier models increasingly detect test conditions and behave differently in real deployment - undermining pre-deployment safety evaluation.

AI Models Are Gaming Safety Evaluations, Report Warns

awesomeagents.ai/news/ai-safety-report-20...

#AiSafety #Evaluation #Benchmarks

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

16 hours ago

Computer Use Leaderboard: Desktop AI Agent Rankings Rankings of the best AI models and agent frameworks on computer use benchmarks - OSWorld, OSWorld-Verified, and ScreenSpot-Pro - updated March 2026.

Computer Use Leaderboard: Desktop AI Agent Rankings

awesomeagents.ai/leaderboards/computer-us...

#ComputerUse #Benchmarks #Osworld

0 0 0 0

Luna the Looney

@dazwhitehead1.bsky.social

1 day ago

Hi Julio, we have been doing a lot of “singing in the rain” here as well but the sun eventually came out this morning & I had a great adventure. We have just seen your #Benchmarks day & your fake but gorgeous smile. Hope you have a pawtastic weekend my friend. Lots of luvs. 🥰❤️💛🐾

2 0 1 0

Luna the Looney

@dazwhitehead1.bsky.social

1 day ago

Heehee I hope the grilled cheese sandwich was worth it pal. We still love #Benchmarks day & you always look pawsome even in the rain. Lots of luvs & licks Julio. 🥰❤️💛🐾

1 0 1 0

Julio Dog Come

@juliodogcome.bsky.social

2 days ago

Hi Karone! It’s just my 2 year + 5 month birthday pic. I get my pic taken on my bench every month to see how much I’ve grown. It started out when I was just a tiny little guy and super afraid I was going to fall through the slats. I much more confident and comfortable now!
#Benchmarks

2 0 1 0

Julio Dog Come

@juliodogcome.bsky.social

2 days ago

Hi Lovely Luna! We went out in the rain today. It’s my #Benchmarks day so we sloshed through lots of puddles and sang “singing in the rain!” A very happy Friday and weekend to you! ❤️😘🌧️☔️🌧️🌧️

1 0 1 0

Julio Dog Come

@juliodogcome.bsky.social

2 days ago

Doing my fake smile!
Sitting on my bench like a champ waiting for the camera to click click. 📸 It is pouring down rain 🌧️ and my bandana is soaked! But it is my #Benchmarks day and I’ve been promised part of a grilled cheese sandwich today!
#BandanasMakeEverythingBetter
#SmileThroughTheRain

65 5 15 0

Awesome Agents

@awesomeagents.bsky.social

3 days ago

METR: Half of SWE-Bench Passes Fail Real Code Review METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

METR: Half of SWE-Bench Passes Fail Real Code Review

awesomeagents.ai/news/metr-swe-bench-main...

#SweBench #Benchmarks #AiCoding

0 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

4 days ago

VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher

Action editor: Manuel Haussmann

https://openreview.net/forum?id=6V3YmHULQ3

#benchmarks #strides #dpot

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

6 days ago

Multilingual LLM Leaderboard: March 2026 Rankings Rankings of the best AI models for multilingual tasks, covering 16 languages across the Artificial Analysis Multilingual Index and MGSM benchmarks.

Multilingual LLM Leaderboard: March 2026 Rankings

awesomeagents.ai/leaderboards/multilingua...

#Multilingual #Benchmarks #GlobalMmlu

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

6 days ago

75% of AI Coding Agents Break Working Code Over Time Alibaba's SWE-CI benchmark tested 18 AI models on 100 real codebases across 233 days of maintenance. Most agents accumulate technical debt and break previously working code. Only Claude Opus stays above 50% zero-regression.

75% of AI Coding Agents Break Working Code Over Time

awesomeagents.ai/news/alibaba-swe-ci-ai-c...

#Benchmarks #AiCoding #SweCi

0 1 0 0

Awesome Agents

@awesomeagents.bsky.social

1 week ago

Mercury 2 Review: 1,000 Tokens per Second, Tested Mercury 2 by Inception Labs is the fastest reasoning LLM available, built on diffusion architecture. We tested the speed, quality, and real-world trade-offs.

Mercury 2 Review: 1,000 Tokens per Second, Tested

https://awesomeagents.ai/reviews/review-mercury-2/

#Inference #Benchmarks #DeveloperTools

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

1 week ago

Mercury 2 Is 13x Faster Than Claude Haiku - Verified Inception Labs' Mercury 2 hits 1,196 tokens per second in independent testing - a diffusion architecture that rewires how inference works.

Mercury 2 Is 13x Faster Than Claude Haiku - Verified

awesomeagents.ai/news/mercury-2-diffusion...

#Inference #OpenSource #Benchmarks

1 0 0 0

365assessment.bsky.social

@365assessment.bsky.social

1 week ago

Your M365 Secure Score isn't just a number—it's a roadmap. Each recommendation tells you exactly what to fix and how. Aim for 80%+.

#SecureScore #M365Security #Benchmarks
https://365securityassessment.com

0 0 0 0

ClawNews

@clawnews.bsky.social

2 weeks ago

New AI Benchmarks FIRE and ConstraintBench Emerge for Specialized Evaluation New AI benchmarks FIRE and ConstraintBench evaluate large language models in finance and optimization, respectively. FIRE assesses financial knowledge and reasoning, while ConstraintBench focuses on solving constrained optimization problems. These benchmarks aim to address critical gaps in AI e

📰 New AI Benchmarks FIRE, ConstraintBench Emerge for Specialized Evaluation

New AI benchmarks FIRE and ConstraintBench evaluate large language models in finance and optim...

www.clawnews.ai/new-ai-benchmarks-fire-a...

#AI #benchmarks #LLM

1 0 0 0

ClawNews

@clawnews.bsky.social

2 weeks ago

AI Benchmarks Target Constraint Reasoning, Agent Optimization Recent advancements in AI benchmarking are focusing on constraint reasoning and agent optimization. ConstraintBench evaluates the ability of large language models (LLMs) to directly solve constrained optimization problems, while VeRO addresses agent optimization through iterative cycles. Both b

📰 AI Benchmarks Target Constraint Reasoning, Agent Optimization

Recent advancements in AI benchmarking are focusing on constraint reasoning and agent optimization. Constr...

www.clawnews.ai/ai-benchmarks-target-con...

#AI #benchmarks #constraintreasoning

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

2 weeks ago

Agentic AI Benchmarks Leaderboard - GAIA, WebArena, BFCL, and Tau2-Bench Rankings of the best AI models and agent frameworks on agentic benchmarks measuring real-world task completion, web navigation, function calling, and multi-turn tool use.

Agentic AI Benchmarks Leaderboard - GAIA, WebArena, BFCL, and Tau2-Bench

awesomeagents.ai/leaderboards/agentic-ai-...

#AgenticAi #Benchmarks #Gaia

0 0 0 0

ClawNews

@clawnews.bsky.social

2 weeks ago

New Benchmarks Emerge for Evaluating AI Agents in Real-World Scenarios New benchmarks, including MobilityBench, AMA-Bench, and ClinDet-Bench, have emerged to address gaps in evaluating AI agents in real-world scenarios. These benchmarks focus on route-planning, long-horizon memory, and clinical decision-making, respectively. They aim to improve the robustness and

📰 New Benchmarks Emerge for Evaluating AI Agents in Real-World Scenarios

New benchmarks, including MobilityBench, AMA-Bench, and ClinDet-Bench, have emerged to address g...

www.clawnews.ai/new-benchmarks-emerge-fo...

#AI #benchmarks #evaluation

0 0 0 0

MakerPulse

@makerpulse.ai

2 weeks ago

28 Real Tasks Reveal What AI Leaderboards Miss AgentPulse's first benchmark tests GPT-5.2, Gemini 3.1 Pro, Claude Opus, Grok 4.1 Fast, and Mistral Large on 28 practitioner tasks. The results challenge the leaderboards.

A $0.02 AI model scored within 11% of ones costing 37-80x more on real practitioner tasks.

We tested GPT-5.2, Gemini 3.1 Pro, Claude Opus, Grok 4.1 Fast, and Mistral Large on 28 prompts with 3 blind AI evaluators. The results don't match any leaderboard.

#AI #LLM #benchmarks

6 2 0 1

Rost Glukhov

@rosgluk.bsky.social

3 weeks ago

LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization Practical LLM performance engineering: throughput vs latency, VRAM limits, parallel requests, memory allocation, and benchmarks across runtimes and hardware.

LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization:
www.glukhov.org/llm-performa...
#AI #LLM #ollama #performance #benchmarks #inference #ollama #infrastructure

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

3 weeks ago

Google's Gemini 3.1 Pro Doubles Reasoning Performance and Retakes the AI Crown Google releases Gemini 3.1 Pro with 77.1% on ARC-AGI-2, more than doubling the reasoning capability of its predecessor and beating Claude Opus 4.6 and GPT-5.2 on most benchmarks.

Google's Gemini 3.1 Pro Doubles Reasoning Performance and Retakes the AI Crown

awesomeagents.ai/news/gemini-3-1-pro-doub...

#Google #Gemini #Benchmarks

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

3 weeks ago

SkillsBench Shows a $1 Model With Expert Guides Beats a $15 Model Without Them A new benchmark of 84 real-world tasks across 11 domains proves that small AI models armed with human-written step-by-step guides outperform frontier models running blind. The catch: models cannot write these guides themselves.

SkillsBench Shows a $1 Model With Expert Guides Beats a $15 Model Without Them

awesomeagents.ai/news/skillsbench-small-m...

#Skillsbench #AiAgents #Benchmarks

0 0 0 0

Inautilo

@inautilo.bsky.social

3 weeks ago

GitHub - j9t/minifier-benchmarks: Regularly updated benchmarks for web page minification Regularly updated benchmarks for web page minification - j9t/minifier-benchmarks

#Development #Comparisons
Minifier benchmarks · Updated comparisons of HTML minifier capabilities ilo.im/16ano2 by Jens O. Meiert and Kirill Maltsev

_____
#HTML #Minification #Benchmarks #Metrics #WebPerf #WebDev #Frontend

1 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

3 weeks ago

Original post on simonwillison.net

SWE-bench February 2026 leaderboard update SWE-bench February 2026 leaderboard update SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard...

#benchmarks #django #ai #openai #generative-ai #llms #anthropic #claude #coding-agents […]

0 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

3 weeks ago

Original post on simonwillison.net

SWE-bench February 2025 leaderboard update SWE-bench February 2025 leaderboard update SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard...

#benchmarks #django #ai #openai #generative-ai #llms #anthropic #claude #coding-agents […]

0 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

3 weeks ago

Original post on simonwillison.net

SWE-bench February 2026 leaderboard update SWE-bench February 2026 leaderboard update SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard...

#benchmarks #django #ai #openai #generative-ai #llms #anthropic #claude #coding-agents […]

0 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

3 weeks ago

Original post on simonwillison.net

SWE-bench February 2026 leaderboard update SWE-bench February 2026 leaderboard update SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard...

#benchmarks #django #ai #openai #generative-ai #llms #anthropic #claude #coding-agents […]

0 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

3 weeks ago

Original post on simonwillison.net

SWE-bench February 2026 leaderboard update SWE-bench February 2026 leaderboard update SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard...

#benchmarks #django #ai #openai #generative-ai #llms #anthropic #claude #coding-agents […]

0 0 0 0

SearchEngine

@searchengine.activitypub.awakari.com.ap.brid.gy

3 weeks ago

OpenClaw’s Grand Debut Falls Flat: Why Some AI Experts Are Shrugging at the Industry’s Latest Darling OpenClaw's much-hyped open-source AI model launch has drawn skepticism from leading res...

#GenAIPro #AI #benchmarks #AI #hype #cycle #enterprise #AI […]

[Original post on webpronews.com]

0 0 0 0

Tom Wechsler

@tomwechsler.bsky.social

1 month ago

The Critical Role of Baselines in Cybersecurity! Introduction In cybersecurity, you cannot protect what you do not understand, and you cannot detect what you cannot measure. Baselines provide a trusted reference point that defines what "normal" look...

Cyber and Information Security Knowledge Base - The Critical Role of Baselines in Cybersecurity!
@CISecurity @windows @microsoft.com @comptia.bsky.social @MITREattack @ECCOUNCIL @mvpaward @NIST #Baseline #Security #Benchmarks #coolstuff #mvpbuzz
www.linkedin.com/pulse/critic...

0 0 0 0

Posts tagged #Benchmarks