This has been a massive community project, and we need you all to participate!
See more: evalevalai.com/projects/eve...
This has been a massive community project, and we need you all to participate!
See more: evalevalai.com/projects/eve...
This 100%
This has to be rage bait. Did we not see the South Park episode where ChatGPT suggested a business idea to convert fries to salad? (And I tried to prompt myself too)
Brisket for thanksgiving >>>> Turkey for thanksgiving
Who is winning the open AI race?
Our new study Economies of Open Intelligence maps @hf.co 851k models' downloads 2020β2025.
1) Power rebalance: US tech β; China + community β
2) Models size & efficient β (MoE, quant, multimodal)
3) Intermediary layers β (adapters/quantizers)
4) Transparency β
/π§΅
I used to love the word βkeyβ until AI models decided to love it and now I cringe at βkey takeawaysβ in text material :(
Itβs that time of the year again! Iβll be at @neuripsconf.bsky.social this year too :) If youβre interested in Responsible AI, AI Evals ( @eval-eval.bsky.social ) or AI4Science (Hugging Science), say hi!
π¨ AI keeps scaling, but social impact evaluations arenβtβand the data proves it π¨
Our new paper, πβWho Evaluates AIβs Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,β analyzes hundreds of evaluation reports and reveals major blind spots βΌοΈπ§΅ (1/7)
A (very incomplete) frontend of Eval Cards can be found here: evalcards.evalevalai.com, and we are now collecting eval datasets (to show in eval cards) on github: github.com/evaleval/eve...
If you want to help see eval cards come alive, get in touch!
Finally, what's next from here? Almost every developer we spoke to said that what we need is a standardized way of reporting, aggregating and comparing all the evals done by both 1st and 3rd parties for a model. This is actually our next project: Eval Cards!
Incredible work done with literally the smartest and most passionate researchers I am lucky to work with. Paper co-led with @ankareuel.bsky.social and Jenny Chim, and other co-authors!
Read the detailed results here: arxiv.org/abs/2511.05613
We also release the code, and the full annotated dataset on Hugging Face (link in paper).
This only strengthens our position that good-quality, independent third-party evaluations are paramount for AI safety.
First-party reports are less transparent or lower quality. We conducted interviews with eval practitioners and found that companies have laid off or reassigned teams dedicated to documentation & social impact evals, or they are being told to focus more on capability reporting.
This is true even at the provider level. We find for e.g., that Google used to do a lot more reporting about their model evaluations in 2022 and 2023 but they reduced reporting in the Gemini era, and same can be seen for Meta over successive Llama versions.
We find that model developers have become less transparent about their eval results over time. For instance Env Cost reporting in first party reports (release docs, model cards, system cards) has drastically declined over time. Less than 15% mention labor or the environment!
We take a look at the entire eval landscape, specifically social impact evals across 7 dimensions: Bias & Harm, Sensitive Content, Performance Disparity, Env. Costs & Emissions, Privacy & Data, Financial Costs, and Moderation Labor. Who is reporting these evals?
Extremely thrilled to talk about our new paper: "Who Evaluates AIβs Social Impacts? Mapping Coverage And Gaps In First And Third Party Evaluations".
This is the first big project output from the
@eval-eval.bsky.social coalition! Thread below:
β¦ this looks like the Nature font oh no
We have a call for posters out! Please submit your extended abstracts, it should be quick and easy. And just like last year, provocative work is especially encouraged as it makes for such interesting conversation π
This. Copyright is a tool for protection but itβs not everything. In fact, thereβs research showing that it is possible to create competitive language models using public domain data only. The proliferation of copyright respecting models would not solve the labor impact policy problem.
Going to San Diego for Neurips? We at @eval-eval.bsky.social , along with the UK AISI, are hosting a closed door state of evals workshop at @ucsandiego.bsky.social on Dec 8th.
Request to join below! :)
evaleval.github.io/events/works...
The thing about non survey papers is that they can still be problematic/fake science etc, and arxiv needs a long overdue + moderated comments section
Datasets are the backbone of AI for Science, and we want to support scientific data natively on Hugging Face. The amazing @lhoestq.hf.co started a discussion on GH for this! Please engage (better still, submit a PR) so we can start supporting your π«΅ dataset:
github.com/huggingface/...
Yes! The Science/Tech/Cyber committee is doing really good work too. Well intentioned folks there trying to actually engage with researchers and industry folks. Love MA
Random off the cuff observation about American AI: LLM folks seem to be concentrated in SF, but AI4Science folks seem to be concentrated in Boston. Meaning as the former gets oversaturated and the latter is only getting started, I expect Boston to be the next big AI epicenter! πͺ
π Weekly AI Evaluation Spotlight π
π€ Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?
This week's paper π"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!
Oof
I have started requesting that panel moderators provide a disclaimer at panels I am on that not all my opinions are provided by my employer. HF ppl largely believe in democratization of AI and open source, but we actually have intense healthy debates internally on edge topics! It's great :)
+1000. I miss life pre-AI hype when the discourse around AI was more scientific and people used to attribute papers and opinions to scientists instead of to their companies. Not all orgs block research papers and sanity check their papers via legal teams, and HF, especially so, is very distributed.