What is your guess? Why is GPT-5 shining so much on WorkArena in contrast to other benchmarks?
Trust me, this is the last time, we're making a benchmark without a hidden test set.
What is your guess? Why is GPT-5 shining so much on WorkArena in contrast to other benchmarks?
Trust me, this is the last time, we're making a benchmark without a hidden test set.
π¨ Is #WorkArena on the verge of being solved? Or did GPT-5 just get trained on it?
π₯While some benchmarks show modest gains, GPT-5 is crushing WorkArena L2π₯
β‘οΈ 69.4% avg success vs. ~40% for next bestπ€―
β‘οΈ Complex tasks, up to 100 steps, 5β20 min for humans
π Huge thanks to the team:
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan
Follow for updates!
#ICCV2025 #VLMs #AI4EO #RemoteSensing #GeospatialAI #MachineLearning #Benchmarking
π Resources
π Paper: arxiv.org/pdf/2411.19325
π Website: the-ai-alliance.github.io/GEO-Bench-VLM
π» Code: github.com/The-AI-Allia...
π¦ Dataset: huggingface.co/datasets/aia...
π What we found:
* GPT-4o crushes it on classification
* LLaVA-OneVision is best at counting
* EarthDial leads in event detection
BUTβ¦
β Most VLMs fail on:
* Temporal reasoning
* Non-optical imagery
* Dense object scenes
π§ͺ We built GEOBench-VLM
β A task-diverse benchmark for geospatial VLM performance
β 31 fine-grained tasks
β 8 categories:
scene understanding, classification, localization, counting, events, captions, segmentation, and more!
π Why GEOBench-VLM?
VLMs like GPT-4o & LLaVA have wowed us on general vision tasks.
But how do they perform on geospatial challenges like satellite imagery, temporal reasoning, or dense object scenes?
Turns outβ¦ we didnβt really know. Until now.
π¨ New benchmark drop!
[#ICCV2025] Our paper "GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks" is accepted at ICCV 2025 in Honolulu, Hawaii! πΊ
Let's dive into what makes it exciting: π§΅
Interested in knowing more about LLMs agents and in contributing to this topic?π
π’We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models π€ #ACL2025NLP in Vienna π»
We have an exciting lineup of speakers
ποΈ Submit your work by *March 1st*
@aclmeeting.bsky.social
Got ideas to share and want to learn about the latest progress?
Consider submitting your work! πhttps://realm-workshop.github.io
Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social
Just found this cool blogpost discussing #AgentLab, #BrowserGym and #TapeAgent
medium.com/@carolynduby...
Notable findings:
πClaude-3.5-Sonnet is insanely good on WorkArena L2
πͺ¨ WorkArena L3 is insanely hard
π€o1-mini is quite good across many benchmarks
π²o1 is very expensive :)
See the leaderboard:
huggingface.co/spaces/Servi...
Visit our paper
πhttps://arxiv.org/abs/2412.05467
Or our open-source tools:
π€https://github.com/ServiceNow/AgentLab
πͺhttps://github.com/ServiceNow/BrowserGym
π―https://github.com/ServiceNow/WorkArena
Weβre really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.
In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
Join us for a co-hosted Happy Hour
NeurIPS 2024
with ServiceNow and IMean.ai
as we explore the cutting edge of WebAgent development!
π
Date: Dec 13th 6:00pm PST
π Location: 15min walk from Neurips see details after RSVP
π RSVP Here: lu.ma/rw9x9vc6
Very excited to see this work coming out from @servicenowresearch.bsky.social. Can't wait to test a trained model in #AgentLab
π Excited to introduce BigDocs!
An open, transparent multimodal dataset designed for:
π Documents
π Web content
π₯οΈ GUI understanding
π¨βπ» Code generation from images
Weβre also launching BigDocs-Bench:
β‘οΈ Document, Web, GUI Visual reasoning
β‘οΈ Converting images into JSON, Markdown, LaTeX, SVG, and more!
Awesome Starter Pack. Thanks @xhluca.bsky.social
π Analyse your agent's behavior using AgentLab-XRay, a custom UI allowing you to navigate all your experiments.
Seamless integration with 10 different web agent benchmarks provided by BrowserGym
github.com/ServiceNow/B...
AgentLab: github.com/ServiceNow/AgentLab/
π Easy large-scale parallel agent experiments
π§ Building blocks for crafting agents over BrowserGym
π€ Unified LLM API for seamless integration
π Reproducibility features for consistent results
π Unified Leaderboard across multiple benchmarks
AgentLab diagram. The image describes AgentLab, a framework for efficient parallel experiments with agents. It highlights: Core Agent Features: Dynamic Prompting and a Unified LLM API for interacting with large language models. BrowserGym Platform: A tool for testing agents on benchmarks like WebArena, WorkArena, MiniWoB, and others. Key Features: Reproducibility, a Unified Leaderboard, an analysis tool called Xray, and a Dataset for sharing agent traces. Blue elements represent AgentLab components.
π§΅-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.