๐จNew Paper!๐จ How do reasoning LLMs handle inferences that have no deterministic answer? We find that they diverge from humans in some significant ways, and fail to reflect human uncertaintyโฆ ๐งต(1/10)
๐จNew Paper!๐จ How do reasoning LLMs handle inferences that have no deterministic answer? We find that they diverge from humans in some significant ways, and fail to reflect human uncertaintyโฆ ๐งต(1/10)
Our team is hiring an intern discrete diffusion of text and/or code. Please apply!
๐ New paper from our team at @servicenowresearch.bsky.social!โฃ
โฃ
๐ซ๐๐ญ๐๐ซ๐
๐ฅ๐จ๐ฐ: ๐๐๐ง๐๐ซ๐๐ญ๐ข๐ง๐ ๐๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐๐ ๐๐จ๐ซ๐ค๐๐ฅ๐จ๐ฐ ๐๐ฎ๐ญ๐ฉ๐ฎ๐ญ๐ฌ ๐
๐ซ๐จ๐ฆ ๐๐ค๐๐ญ๐๐ก ๐๐ฆ๐๐ ๐๐ฌโฃ
We use VLMs to turn ๐ฉ๐ข๐ฏ๐ฅ-๐ฅ๐ณ๐ข๐ธ๐ฏ ๐ด๐ฌ๐ฆ๐ต๐ค๐ฉ๐ฆ๐ด and diagrams into executable workflows ๐๏ธโโ๏ธโฃ
โฃ
๐ arxiv.org/abs/2503.218...
๐ tinyurl.com/3utdbn97%E2%...
#Sketch2Flow #AI #VLM
๐ Excited to share that UI-Vision has been accepted at ICML 2025! ๐
We have also released the UI-Vision grounding datasets. Test your agents on it now! ๐
๐ค Dataset: huggingface.co/datasets/Ser...
#ICML2025 #AI #DatasetRelease #Agents
Very excited to announce our GUI benchmarking dataset UI-Vision : uivision.github.io
Our evals reveal current GUI-models struggle with grounding small elements, dense UIs and has limited domain/spatial/motion understanding.
Watch out this space for more exciting stuff from us!
Web agents powered by LLMs can solve complex tasks, but our analysis shows that they can also be easily misused to automate harmful tasks.
See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
๐ขNew Paper Alert!๐
Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?๐ค
Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment๐งต
๐จ Excited to introduce PairBench! ๐จ
๐ก TL;DR: VLM-judges can fail at data comparison!
โ
PairBench helps you pick the right one by testing alignment, symmetry, smoothness & controllabilityโensuring reliable auto-evaluation.
๐ Paper: arxiv.org/abs/2502.15210
๐งต Thread: ๐
Weโre really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.
In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
If you want to know all about the exciting stuff we do with web agents @servicenowresearch.bsky.social register here and interact with our team including the amazing @alex-lacoste.bsky.social and @adrouinenv.bsky.social
We would be delighted to come and see you ;)
Me :)
Thrilled to launch BigDocsโan open multimodal dataset set to transform document understanding! Our contribution to VLM community, supporting transparency in multimodal document reasoning. Proud to work with the most passionate and amazing team @servicenowresearch.bsky.social !
AgentLab diagram. The image describes AgentLab, a framework for efficient parallel experiments with agents. It highlights: Core Agent Features: Dynamic Prompting and a Unified LLM API for interacting with large language models. BrowserGym Platform: A tool for testing agents on benchmarks like WebArena, WorkArena, MiniWoB, and others. Key Features: Reproducibility, a Unified Leaderboard, an analysis tool called Xray, and a Dataset for sharing agent traces. Blue elements represent AgentLab components.
๐งต-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.