Arkil Patel (@arkil)

Our new paper in #PNAS (bit.ly/4fcWfma) presents a surprising finding—when words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor.

w/ Michelle Yang, ‪@sivareddyg.bsky.social‬ , @msonderegger.bsky.social‬ and @dallascard.bsky.social‬👇(1/12)

29.07.2025 12:05 👍 34 🔁 17 💬 3 📌 2

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.

15.04.2025 19:10 👍 7 🔁 4 💬 1 📌 1

Thoughtology paper is out!! 🔥🐳

We study the reasoning chains of DeepSeek-R1 across a variety of tasks and find several surprising and interesting phenomena!

Incredible effort by the entire team!

🌐: mcgill-nlp.github.io/thoughtology/

02.04.2025 07:10 👍 4 🔁 0 💬 0 📌 0

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval Parishad BehnamGhader, Nicholas Meade, Siva Reddy

Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 🌐💣

Retrievers need to be aligned too! 🚨🚨🚨

Work done with the wonderful Nick and @sivareddyg.bsky.social

🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇

12.03.2025 16:15 👍 12 🔁 8 💬 1 📌 0

Llamas browsing the web look cute, but they are capable of causing a lot of harm!

Check out our new Web Agents ∩ Safety benchmark: SafeArena!

Paper: arxiv.org/abs/2503.04957

10.03.2025 17:50 👍 9 🔁 3 💬 0 📌 0

Paper: arxiv.org/pdf/2502.14678

Data: tinyurl.com/chase-data

Code: github.com/McGill-NLP/C...

21.02.2025 16:28 👍 2 🔁 1 💬 0 📌 0

𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.

21.02.2025 16:28 👍 2 🔁 0 💬 1 📌 0

Results:

- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size

21.02.2025 16:28 👍 2 🔁 0 💬 1 📌 0

𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:

1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks

21.02.2025 16:28 👍 2 🔁 0 💬 1 📌 0

𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:

1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning

21.02.2025 16:28 👍 2 🔁 0 💬 1 📌 0

Why synthetic data for evaluation?

- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns

21.02.2025 16:28 👍 3 🔁 0 💬 1 📌 0

Presenting ✨ 𝐂𝐇𝐀𝐒𝐄: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ✨

Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social

Thread 🧵:

21.02.2025 16:28 👍 17 🔁 8 💬 1 📌 1

Arkil Patel

Latest posts by Arkil Patel @arkil