Trending
Alexandre Lacoste's Avatar

Alexandre Lacoste

@alex-lacoste

MegaSenior Research Scientist at ServiceNow Research, Former Google. WebAgents, Remote Sensing, Climate Change, Opinions are my own

130
Followers
175
Following
20
Posts
03.12.2024
Joined
Posts Following

Latest posts by Alexandre Lacoste @alex-lacoste

Video thumbnail

What is your guess? Why is GPT-5 shining so much on WorkArena in contrast to other benchmarks?

Trust me, this is the last time, we're making a benchmark without a hidden test set.

21.08.2025 18:23 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

🚨 Is #WorkArena on the verge of being solved? Or did GPT-5 just get trained on it?

πŸ”₯While some benchmarks show modest gains, GPT-5 is crushing WorkArena L2πŸ”₯
➑️ 69.4% avg success vs. ~40% for next best🀯
➑️ Complex tasks, up to 100 steps, 5–20 min for humans

21.08.2025 18:23 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

πŸ™Œ Huge thanks to the team:
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan

Follow for updates!
#ICCV2025 #VLMs #AI4EO #RemoteSensing #GeospatialAI #MachineLearning #Benchmarking

02.07.2025 12:47 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

πŸ“Ž Resources

πŸ“„ Paper: arxiv.org/pdf/2411.19325

🌐 Website: the-ai-alliance.github.io/GEO-Bench-VLM

πŸ’» Code: github.com/The-AI-Allia...

πŸ“¦ Dataset: huggingface.co/datasets/aia...

02.07.2025 12:47 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

πŸ” What we found:
* GPT-4o crushes it on classification
* LLaVA-OneVision is best at counting
* EarthDial leads in event detection

BUT…
❌ Most VLMs fail on:
* Temporal reasoning
* Non-optical imagery
* Dense object scenes

02.07.2025 12:47 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

πŸ§ͺ We built GEOBench-VLM
β†’ A task-diverse benchmark for geospatial VLM performance
β†’ 31 fine-grained tasks
β†’ 8 categories:
scene understanding, classification, localization, counting, events, captions, segmentation, and more!

02.07.2025 12:47 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

🌍 Why GEOBench-VLM?
VLMs like GPT-4o & LLaVA have wowed us on general vision tasks.
But how do they perform on geospatial challenges like satellite imagery, temporal reasoning, or dense object scenes?
Turns out… we didn’t really know. Until now.

02.07.2025 12:47 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

🚨 New benchmark drop!
[#ICCV2025] Our paper "GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks" is accepted at ICCV 2025 in Honolulu, Hawaii! 🌺
Let's dive into what makes it exciting: 🧡

02.07.2025 12:47 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image
26.03.2025 18:50 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Interested in knowing more about LLMs agents and in contributing to this topic?πŸš€

πŸ“’We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models πŸ€– #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers

πŸ—“οΈ Submit your work by *March 1st*
@aclmeeting.bsky.social

23.01.2025 14:29 πŸ‘ 13 πŸ” 4 πŸ’¬ 1 πŸ“Œ 1
Post image

Got ideas to share and want to learn about the latest progress?

Consider submitting your work! πŸ”—https://realm-workshop.github.io

Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social

23.01.2025 14:29 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
How ServiceNow Delivers Production Grade AI Agents Large Language Model(LLM) assistants such as ChatGPT have taken the world by storm and revolutionized many everyday tasks but Generative AI…

Just found this cool blogpost discussing #AgentLab, #BrowserGym and #TapeAgent

medium.com/@carolynduby...

13.12.2024 16:13 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Notable findings:
πŸ†Claude-3.5-Sonnet is insanely good on WorkArena L2
πŸͺ¨ WorkArena L3 is insanely hard
πŸ€–o1-mini is quite good across many benchmarks
πŸ’²o1 is very expensive :)

See the leaderboard:
huggingface.co/spaces/Servi...

12.12.2024 17:55 πŸ‘ 4 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Post image

Visit our paper
πŸ“ƒhttps://arxiv.org/abs/2412.05467
Or our open-source tools:
πŸ€–https://github.com/ServiceNow/AgentLab
πŸ’ͺhttps://github.com/ServiceNow/BrowserGym
🎯https://github.com/ServiceNow/WorkArena

12.12.2024 17:55 πŸ‘ 7 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.

In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet

12.12.2024 17:55 πŸ‘ 20 πŸ” 11 πŸ’¬ 1 πŸ“Œ 2
Post image

Join us for a co-hosted Happy Hour
NeurIPS 2024
with ServiceNow and IMean.ai
as we explore the cutting edge of WebAgent development!

πŸ“… Date: Dec 13th 6:00pm PST
πŸ“ Location: 15min walk from Neurips see details after RSVP
πŸŽ‰ RSVP Here: lu.ma/rw9x9vc6

12.12.2024 16:24 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 1

Very excited to see this work coming out from @servicenowresearch.bsky.social. Can't wait to test a trained model in #AgentLab

10.12.2024 22:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

πŸŽ‰ Excited to introduce BigDocs!
An open, transparent multimodal dataset designed for:
πŸ“„ Documents
🌐 Web content
πŸ–₯️ GUI understanding
πŸ‘¨β€πŸ’» Code generation from images
We’re also launching BigDocs-Bench:
➑️ Document, Web, GUI Visual reasoning
➑️ Converting images into JSON, Markdown, LaTeX, SVG, and more!

10.12.2024 18:34 πŸ‘ 16 πŸ” 8 πŸ’¬ 1 πŸ“Œ 2

Awesome Starter Pack. Thanks @xhluca.bsky.social

06.12.2024 00:41 πŸ‘ 3 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Video thumbnail

πŸ” Analyse your agent's behavior using AgentLab-XRay, a custom UI allowing you to navigate all your experiments.

03.12.2024 21:02 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Seamless integration with 10 different web agent benchmarks provided by BrowserGym
github.com/ServiceNow/B...

03.12.2024 21:02 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

AgentLab: github.com/ServiceNow/AgentLab/
πŸš€ Easy large-scale parallel agent experiments
πŸ”§ Building blocks for crafting agents over BrowserGym
πŸ€– Unified LLM API for seamless integration
πŸ” Reproducibility features for consistent results
πŸ† Unified Leaderboard across multiple benchmarks

03.12.2024 21:02 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
AgentLab diagram.

The image describes AgentLab, a framework for efficient parallel experiments with agents. It highlights:

Core Agent Features:

Dynamic Prompting and a Unified LLM API for interacting with large language models.
BrowserGym Platform:

A tool for testing agents on benchmarks like WebArena, WorkArena, MiniWoB, and others.
Key Features:

Reproducibility, a Unified Leaderboard, an analysis tool called Xray, and a Dataset for sharing agent traces.
Blue elements represent AgentLab components.

AgentLab diagram. The image describes AgentLab, a framework for efficient parallel experiments with agents. It highlights: Core Agent Features: Dynamic Prompting and a Unified LLM API for interacting with large language models. BrowserGym Platform: A tool for testing agents on benchmarks like WebArena, WorkArena, MiniWoB, and others. Key Features: Reproducibility, a Unified Leaderboard, an analysis tool called Xray, and a Dataset for sharing agent traces. Blue elements represent AgentLab components.

🧡-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.

03.12.2024 21:02 πŸ‘ 18 πŸ” 15 πŸ’¬ 2 πŸ“Œ 0