Javier Rando's Avatar

Javier Rando

@javirandor.com

Red-Teaming LLMs / PhD student at ETH Zurich / Prev. research intern at Meta / People call me Javi / Vegan 🌱 Website: javirando.com

284
Followers
97
Following
45
Posts
25.11.2024
Joined
Posts Following

Latest posts by Javier Rando @javirandor.com

Thank you so much for the invite!

18.02.2025 22:05 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" probl...

We really hope this analysis can help the community better understand where we come from, where we stand, and what things may help us make meaningful progress in the future.

Co-authored with @jiezhang-ethz.bsky.social, Nicholas Carlini and @floriantramer.bsky.social

arxiv.org/abs/2502.02260

10.02.2025 16:24 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

We propose that adversarial ML research should clearly differentiate between two problems:

1️⃣ Real-world vulnerabilities. Attacks and defenses on ill-defined problems are valuable when harm is immediate.

2️⃣ Scientific understanding. We should study well-defined problems.

10.02.2025 16:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

We are aware that this is not a simple problem and some changes may actually have been for the better! For instance, we now study real-world challenges instead of academic β€œtoy” problems like β„“β‚š robustness. We tried to carefully discuss these alternative views in our work.

10.02.2025 16:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We identify 3 core challenges that make adversarial ML for LLMs harder to define, harder to solve, and harder to evaluate. We then illustrate these with specific case studies: jailbreaks, un-finetunable models, poisoning, prompt injections, membership inference, and unlearning.

10.02.2025 16:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Perhaps most telling, unlike for image classifiers, manual attacks outperform automated methods at finding worst-case inputs for LLMs! This challenges our ability to automatically evaluate the worst-case robustness of protections and benchmark progress.

10.02.2025 16:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Now, the field has shifted to LLMs, where we consider subjective notions of safety, allow for unbounded threat models, and evaluate closed-source systems that constantly change. These changes are hindering our ability to produce meaningful scientific progress.

10.02.2025 16:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Back in the 🐼 days, we dealt with well-defined tasks: misclassify an image by slightly perturbing pixels within an β„“β‚š-ball. Also, attack success and defense utility could be easily measured with classification accuracy. Simple objectives that we could rigorously benchmark.

10.02.2025 16:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Adversarial ML research is evolving, but not necessarily for the better. In our new paper, we argue that LLMs have made problems harder to solve, and even tougher to evaluate. Here’s why another decade of work might still leave us without meaningful progress. πŸ‘‡

10.02.2025 16:24 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Cohere For AI - Javier Rando, AI Safety PhD Student at ETH ZΓΌrich Javier Rando, AI Safety PhD Student at ETH ZΓΌrich - Poisoned Training Data Can Compromise LLMs

Looking forward to this presentation. You can add it to your calendar here cohere.com/events/coher...

20.01.2025 15:39 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Persistent Pre-Training Poisoning of LLMs Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practic...

Recently, we have demonstrated that small amounts of poisoned data posted online could compromise large-scale pretraining with backdoors that persist even after alignment arxiv.org/abs/2410.13722

20.01.2025 15:39 πŸ‘ 0 πŸ” 1 πŸ’¬ 1 πŸ“Œ 1
Preview
Universal Jailbreak Backdoors from Poisoned Human Feedback Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adv...

We poisoned RLHF to introduce backdoors in LLMs that allowed adversaries to elicit harmful generations easily arxiv.org/abs/2311.14455

20.01.2025 15:39 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Cohere For AI - Javier Rando, AI Safety PhD Student at ETH ZΓΌrich Javier Rando, AI Safety PhD Student at ETH ZΓΌrich - Poisoned Training Data Can Compromise LLMs

This Thursday, I will be presenting my work on poisoning RLHF and LLM pretraining @cohereforai.bsky.social

More info here cohere.com/events/coher...

20.01.2025 15:39 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)

11.01.2025 01:53 πŸ‘ 5 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0

Tomorrow @jakublucki.bsky.social will be presenting the BEST TECHNICAL PAPER at the SoLaR workshop at NeurIPS. Come check our poster and his oral presentation!

14.12.2024 03:43 πŸ‘ 7 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

I am at NeurIPS πŸ‡¨πŸ‡¦, please reach out if you want to grab a coffee!

12.12.2024 22:36 πŸ‘ 4 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

I am in beautiful Vancouver for #NeurIPS2024 with those amazing folks!
Say hi if you want to chat about ML privacy and security
(or speciality β˜•)

10.12.2024 19:48 πŸ‘ 0 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
SPY Lab We are a research group at ETH ZΓΌrich studying how to build secure and private AI.

From left to right the amazing @nkristina.bsky.social @jiezhang-ethz.bsky.social @edebenedetti.bsky.social @javirandor.com @aemai.bsky.social and @dpaleka.bsky.social!

We work on AI Security/Safety/Privacy. Find out more about work in our lab website spylab.ai

10.12.2024 19:43 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

SPY Lab is in Vancouver for NeurIPS! Come say hi if you see us around πŸ•΅οΈ

10.12.2024 19:43 πŸ‘ 10 πŸ” 2 πŸ’¬ 1 πŸ“Œ 1
LLMail Inject

Check out all the details in the offical website llmailinject.azurewebsites.net

09.12.2024 17:06 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

A new competition on LLM-agents prompt injection is out! Send malicious emails and get agents to perform unauthorised actions.

The competition is hosted at SaTML 2025 and has a pool of $10k in prizes! What are you waiting for?

09.12.2024 17:06 πŸ‘ 6 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
An Adversarial Perspective on Machine Unlearning for AI Safety Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro...

2) An Adversarial Perspective on Machine Unlearning for AI Safety

πŸ† Best paper award
@solarneurips

πŸ“… Sat 14 Dec. Poster at 11am and Talk in the afternoon.
πŸ“ Room West Meeting 121,122

Paper: arxiv.org/abs/2409.18025

09.12.2024 17:02 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we or...

1) Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition.

πŸ“… Fri 13 Dec 4:30 p.m. PST β€” 7:30 p.m. PST
πŸ“ Spotlight Poster #5203 (West Ballroom A-D)

arxiv.org/abs/2406.07954

09.12.2024 17:02 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I will be at #NeurIPS2024 in Vancouver. I am excited to meet people working on AI Safety and Security. Drop a DM if you want to meet.

I will be presenting two (spotlight!) works. Come say hi to our posters.

09.12.2024 17:02 πŸ‘ 4 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we foundπŸ‘‡

06.12.2024 17:47 πŸ‘ 12 πŸ” 3 πŸ’¬ 1 πŸ“Œ 0
SPY Lab We are a research group at ETH ZΓΌrich studying how to build secure and private AI.

We are not OpenAI, but if you are looking for a PhD or PostDoc on AI Safety/Security/Privacy in Zurich, you should take a look at spylab.ai and come work with us and
@floriantramer.bsky.social

04.12.2024 13:51 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Come do open AI with us in Zurich!
We're hiring PhD students, postdocs (and faculty!)

04.12.2024 13:49 πŸ‘ 11 πŸ” 3 πŸ’¬ 0 πŸ“Œ 1
Preview
AI Safety and Security Join the conversation

I am curating a list of researchers working on AI Safety and Security here go.bsky.app/BcjeVbN.

Reply to this post with your user or other people you think should be included!

04.12.2024 10:38 πŸ‘ 12 πŸ” 3 πŸ’¬ 3 πŸ“Œ 2

Zurich is a great place to live and do research. It became a slightly better one overnight! Excited to see OAI opening an office here with such a great starting team πŸŽ‰

04.12.2024 09:46 πŸ‘ 9 πŸ” 2 πŸ’¬ 1 πŸ“Œ 1

Great opportunity to do impactful work on AI alignment!

02.12.2024 16:07 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0