Avijit Ghosh (@evijit.io)

This has been a massive community project, and we need you all to participate!

See more: evalevalai.com/projects/eve...

17.02.2026 17:39 👍 7 🔁 3 💬 0 📌 0

This 100%

01.02.2026 23:48 👍 1 🔁 0 💬 0 📌 0

This has to be rage bait. Did we not see the South Park episode where ChatGPT suggested a business idea to convert fries to salad? (And I tried to prompt myself too)

17.12.2025 11:45 👍 5 🔁 0 💬 0 📌 0

Brisket for thanksgiving >>>> Turkey for thanksgiving

28.11.2025 13:16 👍 5 🔁 0 💬 0 📌 0

Who is winning the open AI race?

Our new study Economies of Open Intelligence maps @hf.co 851k models' downloads 2020→2025.

1) Power rebalance: US tech ↓; China + community ↑
2) Models size & efficient ↑ (MoE, quant, multimodal)
3) Intermediary layers ↑ (adapters/quantizers)
4) Transparency ↓

/🧵

26.11.2025 16:02 👍 6 🔁 3 💬 1 📌 0

I used to love the word “key” until AI models decided to love it and now I cringe at “key takeaways” in text material :(

25.11.2025 20:24 👍 2 🔁 0 💬 0 📌 0

It’s that time of the year again! I’ll be at @neuripsconf.bsky.social this year too :) If you’re interested in Responsible AI, AI Evals ( @eval-eval.bsky.social ) or AI4Science (Hugging Science), say hi!

20.11.2025 13:19 👍 4 🔁 1 💬 1 📌 0

🚨 AI keeps scaling, but social impact evaluations aren’t–and the data proves it 🚨

Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)

13.11.2025 13:59 👍 11 🔁 3 💬 1 📌 0

AI Evaluation Dashboard Professional AI system evaluation and assessment tool

A (very incomplete) frontend of Eval Cards can be found here: evalcards.evalevalai.com, and we are now collecting eval datasets (to show in eval cards) on github: github.com/evaleval/eve...

If you want to help see eval cards come alive, get in touch!

13.11.2025 14:34 👍 0 🔁 0 💬 0 📌 0

Finally, what's next from here? Almost every developer we spoke to said that what we need is a standardized way of reporting, aggregating and comparing all the evals done by both 1st and 3rd parties for a model. This is actually our next project: Eval Cards!

13.11.2025 14:34 👍 0 🔁 0 💬 1 📌 0

Incredible work done with literally the smartest and most passionate researchers I am lucky to work with. Paper co-led with @ankareuel.bsky.social and Jenny Chim, and other co-authors!

13.11.2025 14:34 👍 0 🔁 0 💬 1 📌 0

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluation...

Read the detailed results here: arxiv.org/abs/2511.05613

We also release the code, and the full annotated dataset on Hugging Face (link in paper).

13.11.2025 14:34 👍 1 🔁 0 💬 1 📌 0

This only strengthens our position that good-quality, independent third-party evaluations are paramount for AI safety.

13.11.2025 14:34 👍 1 🔁 1 💬 1 📌 0

First-party reports are less transparent or lower quality. We conducted interviews with eval practitioners and found that companies have laid off or reassigned teams dedicated to documentation & social impact evals, or they are being told to focus more on capability reporting.

13.11.2025 14:34 👍 1 🔁 0 💬 1 📌 1

This is true even at the provider level. We find for e.g., that Google used to do a lot more reporting about their model evaluations in 2022 and 2023 but they reduced reporting in the Gemini era, and same can be seen for Meta over successive Llama versions.

13.11.2025 14:34 👍 1 🔁 0 💬 1 📌 0

We find that model developers have become less transparent about their eval results over time. For instance Env Cost reporting in first party reports (release docs, model cards, system cards) has drastically declined over time. Less than 15% mention labor or the environment!

13.11.2025 14:34 👍 0 🔁 0 💬 1 📌 0

We take a look at the entire eval landscape, specifically social impact evals across 7 dimensions: Bias & Harm, Sensitive Content, Performance Disparity, Env. Costs & Emissions, Privacy & Data, Financial Costs, and Moderation Labor. Who is reporting these evals?

13.11.2025 14:34 👍 0 🔁 0 💬 1 📌 0

Extremely thrilled to talk about our new paper: "Who Evaluates AI’s Social Impacts? Mapping Coverage And Gaps In First And Third Party Evaluations".

This is the first big project output from the
@eval-eval.bsky.social coalition! Thread below:

13.11.2025 14:34 👍 18 🔁 7 💬 1 📌 0

… this looks like the Nature font oh no

11.11.2025 22:54 👍 2 🔁 0 💬 1 📌 0

We have a call for posters out! Please submit your extended abstracts, it should be quick and easy. And just like last year, provocative work is especially encouraged as it makes for such interesting conversation 😈

06.11.2025 21:22 👍 3 🔁 2 💬 0 📌 0

This. Copyright is a tool for protection but it’s not everything. In fact, there’s research showing that it is possible to create competitive language models using public domain data only. The proliferation of copyright respecting models would not solve the labor impact policy problem.

02.11.2025 12:25 👍 24 🔁 2 💬 1 📌 0

2025 Workshop on Evaluating AI in Practice EvalEval, UK AI Security Institute (AISI), and UC San Diego (UCSD) are excited to announce the upcoming Evaluating AI in Practice workshop, happening on December 8, 2025, in San Diego, California.

Going to San Diego for Neurips? We at @eval-eval.bsky.social , along with the UK AISI, are hosting a closed door state of evals workshop at @ucsandiego.bsky.social on Dec 8th.

Request to join below! :)

evaleval.github.io/events/works...

01.11.2025 16:46 👍 0 🔁 0 💬 0 📌 0

The thing about non survey papers is that they can still be problematic/fake science etc, and arxiv needs a long overdue + moderated comments section

01.11.2025 16:41 👍 2 🔁 0 💬 0 📌 0

Support scientific data formats · Issue #7804 · huggingface/datasets List of formats and libraries we can use to load the data in datasets: DICOMs: pydicom NIfTIs: nibabel WFDB: wfdb cc @zaRizk7 for viz Feel free to comment / suggest other formats and libs you'd lik...

Datasets are the backbone of AI for Science, and we want to support scientific data natively on Hugging Face. The amazing @lhoestq.hf.co started a discussion on GH for this! Please engage (better still, submit a PR) so we can start supporting your 🫵 dataset:

github.com/huggingface/...

28.10.2025 16:33 👍 2 🔁 0 💬 0 📌 0

Yes! The Science/Tech/Cyber committee is doing really good work too. Well intentioned folks there trying to actually engage with researchers and industry folks. Love MA

24.10.2025 19:05 👍 2 🔁 0 💬 0 📌 0

Random off the cuff observation about American AI: LLM folks seem to be concentrated in SF, but AI4Science folks seem to be concentrated in Boston. Meaning as the former gets oversaturated and the latter is only getting started, I expect Boston to be the next big AI epicenter! 💪

24.10.2025 18:32 👍 2 🔁 0 💬 1 📌 1

🌟 Weekly AI Evaluation Spotlight 🌟

🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?

This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!

24.10.2025 16:44 👍 5 🔁 2 💬 1 📌 0

Oof

20.10.2025 21:33 👍 0 🔁 0 💬 0 📌 0

I have started requesting that panel moderators provide a disclaimer at panels I am on that not all my opinions are provided by my employer. HF ppl largely believe in democratization of AI and open source, but we actually have intense healthy debates internally on edge topics! It's great :)

20.10.2025 19:01 👍 3 🔁 0 💬 0 📌 0

+1000. I miss life pre-AI hype when the discourse around AI was more scientific and people used to attribute papers and opinions to scientists instead of to their companies. Not all orgs block research papers and sanity check their papers via legal teams, and HF, especially so, is very distributed.

20.10.2025 19:01 👍 3 🔁 0 💬 1 📌 0

Avijit Ghosh

Latest posts by Avijit Ghosh @evijit.io