Thank you so much for the invite!
Thank you so much for the invite!
We really hope this analysis can help the community better understand where we come from, where we stand, and what things may help us make meaningful progress in the future.
Co-authored with @jiezhang-ethz.bsky.social, Nicholas Carlini and @floriantramer.bsky.social
arxiv.org/abs/2502.02260
We propose that adversarial ML research should clearly differentiate between two problems:
1οΈβ£ Real-world vulnerabilities. Attacks and defenses on ill-defined problems are valuable when harm is immediate.
2οΈβ£ Scientific understanding. We should study well-defined problems.
We are aware that this is not a simple problem and some changes may actually have been for the better! For instance, we now study real-world challenges instead of academic βtoyβ problems like ββ robustness. We tried to carefully discuss these alternative views in our work.
We identify 3 core challenges that make adversarial ML for LLMs harder to define, harder to solve, and harder to evaluate. We then illustrate these with specific case studies: jailbreaks, un-finetunable models, poisoning, prompt injections, membership inference, and unlearning.
Perhaps most telling, unlike for image classifiers, manual attacks outperform automated methods at finding worst-case inputs for LLMs! This challenges our ability to automatically evaluate the worst-case robustness of protections and benchmark progress.
Now, the field has shifted to LLMs, where we consider subjective notions of safety, allow for unbounded threat models, and evaluate closed-source systems that constantly change. These changes are hindering our ability to produce meaningful scientific progress.
Back in the πΌ days, we dealt with well-defined tasks: misclassify an image by slightly perturbing pixels within an ββ-ball. Also, attack success and defense utility could be easily measured with classification accuracy. Simple objectives that we could rigorously benchmark.
Adversarial ML research is evolving, but not necessarily for the better. In our new paper, we argue that LLMs have made problems harder to solve, and even tougher to evaluate. Hereβs why another decade of work might still leave us without meaningful progress. π
Looking forward to this presentation. You can add it to your calendar here cohere.com/events/coher...
Recently, we have demonstrated that small amounts of poisoned data posted online could compromise large-scale pretraining with backdoors that persist even after alignment arxiv.org/abs/2410.13722
We poisoned RLHF to introduce backdoors in LLMs that allowed adversaries to elicit harmful generations easily arxiv.org/abs/2311.14455
This Thursday, I will be presenting my work on poisoning RLHF and LLM pretraining @cohereforai.bsky.social
More info here cohere.com/events/coher...
Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)
Tomorrow @jakublucki.bsky.social will be presenting the BEST TECHNICAL PAPER at the SoLaR workshop at NeurIPS. Come check our poster and his oral presentation!
I am at NeurIPS π¨π¦, please reach out if you want to grab a coffee!
I am in beautiful Vancouver for #NeurIPS2024 with those amazing folks!
Say hi if you want to chat about ML privacy and security
(or speciality β)
From left to right the amazing @nkristina.bsky.social @jiezhang-ethz.bsky.social @edebenedetti.bsky.social @javirandor.com @aemai.bsky.social and @dpaleka.bsky.social!
We work on AI Security/Safety/Privacy. Find out more about work in our lab website spylab.ai
SPY Lab is in Vancouver for NeurIPS! Come say hi if you see us around π΅οΈ
Check out all the details in the offical website llmailinject.azurewebsites.net
A new competition on LLM-agents prompt injection is out! Send malicious emails and get agents to perform unauthorised actions.
The competition is hosted at SaTML 2025 and has a pool of $10k in prizes! What are you waiting for?
2) An Adversarial Perspective on Machine Unlearning for AI Safety
π Best paper award
@solarneurips
π
Sat 14 Dec. Poster at 11am and Talk in the afternoon.
π Room West Meeting 121,122
Paper: arxiv.org/abs/2409.18025
1) Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition.
π
Fri 13 Dec 4:30 p.m. PST β 7:30 p.m. PST
π Spotlight Poster #5203 (West Ballroom A-D)
arxiv.org/abs/2406.07954
I will be at #NeurIPS2024 in Vancouver. I am excited to meet people working on AI Safety and Security. Drop a DM if you want to meet.
I will be presenting two (spotlight!) works. Come say hi to our posters.
π¨Unlearned hazardous knowledge can be retrieved from LLMs π¨
Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.
Here's what we foundπ
We are not OpenAI, but if you are looking for a PhD or PostDoc on AI Safety/Security/Privacy in Zurich, you should take a look at spylab.ai and come work with us and
@floriantramer.bsky.social
Come do open AI with us in Zurich!
We're hiring PhD students, postdocs (and faculty!)
I am curating a list of researchers working on AI Safety and Security here go.bsky.app/BcjeVbN.
Reply to this post with your user or other people you think should be included!
Zurich is a great place to live and do research. It became a slightly better one overnight! Excited to see OAI opening an office here with such a great starting team π
Great opportunity to do impactful work on AI alignment!