Spandana Gella (@spandanagella)

🚨New Paper!🚨 How do reasoning LLMs handle inferences that have no deterministic answer? We find that they diverge from humans in some significant ways, and fail to reflect human uncertainty… 🧵(1/10)

04.03.2026 16:13 👍 56 🔁 20 💬 3 📌 1

Our team is hiring an intern discrete diffusion of text and/or code. Please apply!

17.06.2025 14:32 👍 2 🔁 0 💬 0 📌 0

🚀 New paper from our team at @servicenowresearch.bsky.social!⁣
⁣
💫𝐒𝐭𝐚𝐫𝐅𝐥𝐨𝐰: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐒𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞𝐝 𝐖𝐨𝐫𝐤𝐟𝐥𝐨𝐰 𝐎𝐮𝐭𝐩𝐮𝐭𝐬 𝐅𝐫𝐨𝐦 𝐒𝐤𝐞𝐭𝐜𝐡 𝐈𝐦𝐚𝐠𝐞𝐬⁣
We use VLMs to turn 𝘩𝘢𝘯𝘥-𝘥𝘳𝘢𝘸𝘯 𝘴𝘬𝘦𝘵𝘤𝘩𝘦𝘴 and diagrams into executable workflows 🖍️→⚙️⁣
⁣
🔗 arxiv.org/abs/2503.218...
📝 tinyurl.com/3utdbn97%E2%...
#Sketch2Flow #AI #VLM

29.05.2025 03:34 👍 0 🔁 1 💬 1 📌 0

🚀 Excited to share that UI-Vision has been accepted at ICML 2025! 🎉

We have also released the UI-Vision grounding datasets. Test your agents on it now! 🚀

🤗 Dataset: huggingface.co/datasets/Ser...

#ICML2025 #AI #DatasetRelease #Agents

15.05.2025 14:14 👍 0 🔁 1 💬 0 📌 0

Very excited to announce our GUI benchmarking dataset UI-Vision : uivision.github.io

Our evals reveal current GUI-models struggle with grounding small elements, dense UIs and has limited domain/spatial/motion understanding.

Watch out this space for more exciting stuff from us!

24.03.2025 17:17 👍 3 🔁 0 💬 0 📌 0

Web agents powered by LLMs can solve complex tasks, but our analysis shows that they can also be easily misused to automate harmful tasks.

See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).

10.03.2025 20:11 👍 5 🔁 2 💬 0 📌 0

📢New Paper Alert!🚀

Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔

Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵

04.03.2025 16:08 👍 28 🔁 13 💬 2 📌 3

🚨 Excited to introduce PairBench! 🚨

💡 TL;DR: VLM-judges can fail at data comparison!

✅ PairBench helps you pick the right one by testing alignment, symmetry, smoothness & controllability—ensuring reliable auto-evaluation.

📄 Paper: arxiv.org/abs/2502.15210

🧵 Thread: 👇

27.02.2025 19:50 👍 1 🔁 2 💬 1 📌 0

We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.

In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet

12.12.2024 17:55 👍 20 🔁 11 💬 1 📌 2

If you want to know all about the exciting stuff we do with web agents @servicenowresearch.bsky.social register here and interact with our team including the amazing @alex-lacoste.bsky.social and @adrouinenv.bsky.social

12.12.2024 17:28 👍 2 🔁 0 💬 0 📌 0

We would be delighted to come and see you ;)

12.12.2024 12:52 👍 1 🔁 0 💬 0 📌 0

Me :)

12.12.2024 08:38 👍 0 🔁 0 💬 1 📌 0

Thrilled to launch BigDocs—an open multimodal dataset set to transform document understanding! Our contribution to VLM community, supporting transparency in multimodal document reasoning. Proud to work with the most passionate and amazing team @servicenowresearch.bsky.social !

10.12.2024 20:08 👍 4 🔁 0 💬 1 📌 0

AgentLab diagram. The image describes AgentLab, a framework for efficient parallel experiments with agents. It highlights: Core Agent Features: Dynamic Prompting and a Unified LLM API for interacting with large language models. BrowserGym Platform: A tool for testing agents on benchmarks like WebArena, WorkArena, MiniWoB, and others. Key Features: Reproducibility, a Unified Leaderboard, an analysis tool called Xray, and a Dataset for sharing agent traces. Blue elements represent AgentLab components.

🧵-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.

03.12.2024 21:02 👍 18 🔁 15 💬 2 📌 0

Spandana Gella

Latest posts by Spandana Gella @spandanagella