Our new paper in #PNAS (bit.ly/4fcWfma) presents a surprising finding—when words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor.
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
29.07.2025 12:05
👍 34
🔁 17
💬 3
📌 2
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
15.04.2025 19:10
👍 7
🔁 4
💬 1
📌 1
Thoughtology paper is out!! 🔥🐳
We study the reasoning chains of DeepSeek-R1 across a variety of tasks and find several surprising and interesting phenomena!
Incredible effort by the entire team!
🌐: mcgill-nlp.github.io/thoughtology/
02.04.2025 07:10
👍 4
🔁 0
💬 0
📌 0
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader, Nicholas Meade, Siva Reddy
Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 🌐💣
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
12.03.2025 16:15
👍 12
🔁 8
💬 1
📌 0
Llamas browsing the web look cute, but they are capable of causing a lot of harm!
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
10.03.2025 17:50
👍 9
🔁 3
💬 0
📌 0
Paper: arxiv.org/pdf/2502.14678
Data: tinyurl.com/chase-data
Code: github.com/McGill-NLP/C...
21.02.2025 16:28
👍 2
🔁 1
💬 0
📌 0
𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.
21.02.2025 16:28
👍 2
🔁 0
💬 1
📌 0
Results:
- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size
21.02.2025 16:28
👍 2
🔁 0
💬 1
📌 0
𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:
1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
21.02.2025 16:28
👍 2
🔁 0
💬 1
📌 0
𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:
1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning
21.02.2025 16:28
👍 2
🔁 0
💬 1
📌 0
Why synthetic data for evaluation?
- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns
21.02.2025 16:28
👍 3
🔁 0
💬 1
📌 0
Presenting ✨ 𝐂𝐇𝐀𝐒𝐄: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ✨
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵:
21.02.2025 16:28
👍 17
🔁 8
💬 1
📌 1