#browsecomp

5 days ago

How Anthropic's Claude Opus 4.6 Broke Its Own AI Benchmark Anthropic has revealed Claude Opus 4.6 identified the BrowseComp benchmark and decrypted its answer key, raising serious AI evaluation integrity concerns.

winbuzzer.com/2026/03/10/a...

Anthropic's Claude Opus 4.6 Broke Its Own AI Benchmark

#AI #Anthropic #LLMs #Claude #ClaudeOpus46 #AISafety #AIBenchmarks #AIResearch #MachineLearning #BrowseComp

1 0 0 0

ToxSec

@toxsec.bsky.social

3 weeks ago

#Gemini 3.1 is here.

another day another #benchmark drop.

Gemini 3.1 is here.

stats looks pretty good honestly.

look at that #ARC-AGI-2 jump!

#BrowseComp also through the roof, so it should have a really good agentic search function.

2 0 0 0

AI Daily Post

@aidailypost.com

3 months ago

Gemini’s Deep Research agent just aced Humanity’s Last Exam, topping HLE, DeepSearchQA and leading BrowseComp. Curious how it stacks up against Google Search and NotebookLM? Dive into the benchmark details! #GeminiDeepResearch #DeepSearchQA #BrowseComp

🔗 aidailypost.com/news/gemini-...

0 0 0 0

GetNews.me

@getnews-me.bsky.social

5 months ago

WebSailor-V2 Boosts Open-Source AI Agents with Synthetic Data and RL

WebSailor-V2 narrows the open‑source LLM agent gap with synthetic tasks, RFT fine‑tuning and DUPO reinforcement learning, matching proprietary performance on BrowseComp. Read more: getnews.me/websailor-v2-boosts-open... #websailorv2 #browsecomp

0 0 0 0

キタきつね

@kitafox.bsky.social

11 months ago

OpenAIがAIのウェブ検索能力を測定する高難度ベンチマーク「BrowseComp」を発表 #Gigazine (Apr 11)

OpenAIが発表した「BrowseComp」は、高度なAIエージェントのウェブ検索力と情報統合能力を測るために設計された高難度ベンチマークであり、単純検索を超えた柔軟な情報探索力が求められる。#ChatGPT記事要約

 #AIベンチマーク #BrowseComp #OpenAI #検索AI #情報探索

gigazine.net/news/2025041...