Alexandre Lacoste (@alex-lacoste)

What is your guess? Why is GPT-5 shining so much on WorkArena in contrast to other benchmarks?

Trust me, this is the last time, we're making a benchmark without a hidden test set.

21.08.2025 18:23 👍 0 🔁 0 💬 0 📌 0

🚨 Is #WorkArena on the verge of being solved? Or did GPT-5 just get trained on it?

🔥While some benchmarks show modest gains, GPT-5 is crushing WorkArena L2🔥
➡️ 69.4% avg success vs. ~40% for next best🤯
➡️ Complex tasks, up to 100 steps, 5–20 min for humans

21.08.2025 18:23 👍 0 🔁 0 💬 1 📌 0

🙌 Huge thanks to the team:
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan

Follow for updates!
#ICCV2025 #VLMs #AI4EO #RemoteSensing #GeospatialAI #MachineLearning #Benchmarking

02.07.2025 12:47 👍 1 🔁 0 💬 0 📌 0

📎 Resources

📄 Paper: arxiv.org/pdf/2411.19325

🌐 Website: the-ai-alliance.github.io/GEO-Bench-VLM

💻 Code: github.com/The-AI-Allia...

📦 Dataset: huggingface.co/datasets/aia...

02.07.2025 12:47 👍 0 🔁 0 💬 1 📌 0

🔍 What we found:
* GPT-4o crushes it on classification
* LLaVA-OneVision is best at counting
* EarthDial leads in event detection

BUT…
❌ Most VLMs fail on:
* Temporal reasoning
* Non-optical imagery
* Dense object scenes

02.07.2025 12:47 👍 0 🔁 0 💬 1 📌 0

🧪 We built GEOBench-VLM
→ A task-diverse benchmark for geospatial VLM performance
→ 31 fine-grained tasks
→ 8 categories:
scene understanding, classification, localization, counting, events, captions, segmentation, and more!

02.07.2025 12:47 👍 0 🔁 0 💬 1 📌 0

🌍 Why GEOBench-VLM?
VLMs like GPT-4o & LLaVA have wowed us on general vision tasks.
But how do they perform on geospatial challenges like satellite imagery, temporal reasoning, or dense object scenes?
Turns out… we didn’t really know. Until now.

02.07.2025 12:47 👍 0 🔁 0 💬 1 📌 0

🚨 New benchmark drop!
[#ICCV2025] Our paper "GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks" is accepted at ICCV 2025 in Honolulu, Hawaii! 🌺
Let's dive into what makes it exciting: 🧵

02.07.2025 12:47 👍 1 🔁 0 💬 1 📌 0

26.03.2025 18:50 👍 1 🔁 0 💬 0 📌 0

Interested in knowing more about LLMs agents and in contributing to this topic?🚀

📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers

🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social

23.01.2025 14:29 👍 13 🔁 4 💬 1 📌 1

Got ideas to share and want to learn about the latest progress?

Consider submitting your work! 🔗https://realm-workshop.github.io

Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social

23.01.2025 14:29 👍 1 🔁 1 💬 1 📌 0

How ServiceNow Delivers Production Grade AI Agents Large Language Model(LLM) assistants such as ChatGPT have taken the world by storm and revolutionized many everyday tasks but Generative AI…

Just found this cool blogpost discussing #AgentLab, #BrowserGym and #TapeAgent

medium.com/@carolynduby...

13.12.2024 16:13 👍 1 🔁 0 💬 0 📌 0

Notable findings:
🏆Claude-3.5-Sonnet is insanely good on WorkArena L2
🪨 WorkArena L3 is insanely hard
🤖o1-mini is quite good across many benchmarks
💲o1 is very expensive :)

See the leaderboard:
huggingface.co/spaces/Servi...

12.12.2024 17:55 👍 4 🔁 1 💬 0 📌 0

Visit our paper
📃https://arxiv.org/abs/2412.05467
Or our open-source tools:
🤖https://github.com/ServiceNow/AgentLab
💪https://github.com/ServiceNow/BrowserGym
🎯https://github.com/ServiceNow/WorkArena

12.12.2024 17:55 👍 7 🔁 1 💬 1 📌 0

We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.

In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet

12.12.2024 17:55 👍 20 🔁 11 💬 1 📌 2

Join us for a co-hosted Happy Hour
NeurIPS 2024
with ServiceNow and IMean.ai
as we explore the cutting edge of WebAgent development!

📅 Date: Dec 13th 6:00pm PST
📍 Location: 15min walk from Neurips see details after RSVP
🎉 RSVP Here: lu.ma/rw9x9vc6

12.12.2024 16:24 👍 1 🔁 0 💬 0 📌 1

Very excited to see this work coming out from @servicenowresearch.bsky.social. Can't wait to test a trained model in #AgentLab

10.12.2024 22:55 👍 0 🔁 0 💬 0 📌 0

🎉 Excited to introduce BigDocs!
An open, transparent multimodal dataset designed for:
📄 Documents
🌐 Web content
🖥️ GUI understanding
👨‍💻 Code generation from images
We’re also launching BigDocs-Bench:
➡️ Document, Web, GUI Visual reasoning
➡️ Converting images into JSON, Markdown, LaTeX, SVG, and more!

10.12.2024 18:34 👍 16 🔁 8 💬 1 📌 2

Awesome Starter Pack. Thanks @xhluca.bsky.social

06.12.2024 00:41 👍 3 🔁 1 💬 0 📌 0

🔍 Analyse your agent's behavior using AgentLab-XRay, a custom UI allowing you to navigate all your experiments.

03.12.2024 21:02 👍 3 🔁 0 💬 0 📌 0

Seamless integration with 10 different web agent benchmarks provided by BrowserGym
github.com/ServiceNow/B...

03.12.2024 21:02 👍 4 🔁 0 💬 1 📌 0

AgentLab: github.com/ServiceNow/AgentLab/
🚀 Easy large-scale parallel agent experiments
🔧 Building blocks for crafting agents over BrowserGym
🤖 Unified LLM API for seamless integration
🔁 Reproducibility features for consistent results
🏆 Unified Leaderboard across multiple benchmarks

03.12.2024 21:02 👍 4 🔁 0 💬 1 📌 0

AgentLab diagram. The image describes AgentLab, a framework for efficient parallel experiments with agents. It highlights: Core Agent Features: Dynamic Prompting and a Unified LLM API for interacting with large language models. BrowserGym Platform: A tool for testing agents on benchmarks like WebArena, WorkArena, MiniWoB, and others. Key Features: Reproducibility, a Unified Leaderboard, an analysis tool called Xray, and a Dataset for sharing agent traces. Blue elements represent AgentLab components.

🧵-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.

03.12.2024 21:02 👍 18 🔁 15 💬 2 📌 0

Alexandre Lacoste

Latest posts by Alexandre Lacoste @alex-lacoste