Daniel Vila's Avatar

Daniel Vila

@dvilasuero.hf.co

Everything datasets and human feedback for AI at Hugging Face. Prev: co-founder and CEO of Argilla (acquired by Hugging Face)

3,595
Followers
573
Following
56
Posts
31.10.2024
Joined
Posts Following

Latest posts by Daniel Vila @dvilasuero.hf.co

Post image

๐Ÿš€ The open source community is unstoppable: 4M total downloads for DeepSeek models on @hf.co , with 3.2M coming from the +600 models created by the community. That's 30% more than yesterday!

28.01.2025 17:55 ๐Ÿ‘ 8 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐Ÿ’ซ Generate RAG data with the Synthetic Data Generator to improve your RAG system!

1๏ธโƒฃ Generate from your documents, dataset, or dataset description.
2๏ธโƒฃ Configure it.
3๏ธโƒฃ Generate the synthetic dataset.
4๏ธโƒฃ Fine-tune the retrieval and reranking models.
5๏ธโƒฃ Build a RAG pipeline.

20.01.2025 16:42 ๐Ÿ‘ 12 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Screenshot of the Introduction to Argilla in Chapter 10 of the Hugging Face NLP course

Screenshot of the Introduction to Argilla in Chapter 10 of the Hugging Face NLP course

New chapter in the Hugging Face NLP course! ๐Ÿค— ๐Ÿš€

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.ย 

Any feedback for improvements welcome!

17.01.2025 10:02 ๐Ÿ‘ 14 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Screenshot of this text:   Total annotations submitted: 50,035  Languages with annotations: 115  Total contributors: 419

Screenshot of this text: Total annotations submitted: 50,035 Languages with annotations: 115 Total contributors: 419

๐ŸŽ‰ 50,000+ annotations reached! The FineWeb2-C community is helping build better language models on annotation at a time.

๐Ÿ“Š Current stats:
- 115 languages represented
- 419 amazing contributors
- 24 languages with complete datasets

But we're not done yet! ๐Ÿงต

16.01.2025 17:32 ๐Ÿ‘ 18 ๐Ÿ” 6 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Synthetic Data Generator - a Hugging Face Space by argilla Build datasets using natural language

You could try to generate one with this tool:

huggingface.co/spaces/argil...

07.01.2025 18:32 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
Fine-tune a SmolLM on domain-specific synthetic data from a LLM A Blog post by David Berenstein on Hugging Face

High-quality data for fine-tuning language models for free and at the click of a button!

Prompt and wait for your dataset to push to Argilla or the Hub
Evaluate, review and fine-tune a model.

Blog:

07.01.2025 13:00 ๐Ÿ‘ 10 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Video thumbnail

Was 2024 the year of datasets? Is 2025 the year for community-built datasets?

It's exciting to see the progress of many languages in FineWeb-C:
- Total annotations submitted: 41,577
- Languages with annotations: 106
- Total contributors: 363

03.01.2025 12:00 ๐Ÿ‘ 28 ๐Ÿ” 4 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Progress bars showing remaining annotations needed for 15 languages in FineWeb-C dataset, ranging from 6 to 593 annotations needed

Progress bars showing remaining annotations needed for 15 languages in FineWeb-C dataset, ranging from 6 to 593 annotations needed

The finish line is near! We're building FineWeb-Edu for many languages and need your help ๐Ÿค—

Many FineWeb-C languages are close to 1,000 annotations!

Assamese is 99.4% done, French needs 64 more annotations, Tamil: 216.

Please help us reach the goal: huggingface.co/spaces/data-...

06.01.2025 14:32 ๐Ÿ‘ 20 ๐Ÿ” 5 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Preview
Quickstart - Argilla Docs Get started with Argilla in less 10 minutes

Get started:
docs.argilla.io/latest/getti...

20.12.2024 11:14 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Release notes:
github.com/argilla-io/a...

20.12.2024 11:14 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Video thumbnail

๐Ÿ’ฅ Ending 2024: A full data annotation journey on the Hugging Face Hubโ€”from raw data to training-ready datasets!

With Argilla 2.6.0, push your data to the Hub from the UI

Letโ€™s make 2025 the year anyone can build more transparent and accountable AIโ€”no coding or model skills needed.

20.12.2024 11:14 ๐Ÿ‘ 20 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Video thumbnail

๐Ÿš€ Argilla v2.6.0 is here! ๐ŸŽ‰

Let me show you how EASY it is to export your annotated datasets from Argilla to the Hugging Face Hub. ๐Ÿคฉ

Take a look to this quick demo ๐Ÿ‘‡

๐Ÿ’โ€โ™‚๏ธ More info about the release at github.com/argilla-io/a...

#AI #MachineLearning #OpenSource #DataScience #HuggingFace #Argilla

19.12.2024 12:39 ๐Ÿ‘ 11 ๐Ÿ” 5 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 1
Preview
Introducing the Synthetic Data Generator - Build Datasets with Natural Language Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

๐Ÿ”ฅ We got great feedback on this: "Synthetic Data Generator"

A no-code tool to create datasets with LLMs, making it a breeze, allowing ANYONE to create datasets and models in minutes and without any code.

Blog: https://buff.ly/4gybyoT
GitHub: https://buff.ly/49IDSmd
Space: https://buff.ly/3Y1S99z

17.12.2024 07:18 ๐Ÿ‘ 14 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
tam - เฎคเฎฎเฎฟเฎดเฏ - Tamil Join and contribute to the dataset tam - เฎคเฎฎเฎฟเฎดเฏ - Tamil

Well, around 10 percent of the initial goal is complete, and so far, it's been quite a one-man army effort. We're still in the hunt for more people to join and contribute to this open-source initiative.

@hf.co

data-is-better-together-fineweb-c.hf.space/share-your-p...

14.12.2024 07:33 ๐Ÿ‘ 4 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
nds - Neddersassโ€™sch - Low German Join and contribute to the dataset nds - Neddersassโ€™sch - Low German

The sprint for crowd sourced annotations with argilla is in full swing over at data-is-better-together-fineweb-c.hf.space

I've just contributed 100 examples to this dataset:
data-is-better-together-fineweb-c.hf.space/share-your-p...

Big thanks to @dvilasuero.hf.co, @nataliaelv.hf.co and team ๐Ÿ™Œ

13.12.2024 07:38 ๐Ÿ‘ 12 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

I've been building a small library for working with prompt templates on the @huggingface.bsky.social Hub: `pip install prompt-templates`. Motivation:

The community currently shares prompt templates in a wide variety of formats: in datasets, in model cards, as strings in .py files, as .txt/... ๐Ÿงต

12.12.2024 15:58 ๐Ÿ‘ 16 ๐Ÿ” 4 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
sco - Scots - Scots Join and contribute to the dataset sco - Scots - Scots

Desperate to contribute to the development of Scots language AI. I've just contributed 16 examples to this dataset:

data-is-better-together-fineweb-c.hf.space/share-your-p...

12.12.2024 13:44 ๐Ÿ‘ 10 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
spa - espaรฑol - Spanish Join and contribute to the dataset spa - espaรฑol - Spanish

I've just contributed 156 examples to the FineWeb 2 Spanish dataset:

data-is-better-together-fineweb-c.hf.space/share-your-p...

If you want to contribute, sign in with @hf.co and find your language

12.12.2024 13:23 ๐Ÿ‘ 21 ๐Ÿ” 5 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
FineWeb-c - Annotation - a Hugging Face Space by data-is-better-together Discover amazing ML apps made by the community

Join this Space, search for your language, and start contributing:
huggingface.co/spaces/data-...

Don't know how to start, want to discuss? Join:
huggingface.co/spaces/Huggi...

10.12.2024 14:12 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Help shape the future of multilingual Open Source AI!

Join the FineWeb 2 Community Annotation Sprint to create an open training dataset with full transparency and human validation in many languages.

Review datasets in your language and help identify the best sources for training.

10.12.2024 14:12 ๐Ÿ‘ 21 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

โœจ Argilla 2.5.0 is live and it comes with webhook listener support to supercharge your workflows! ๐Ÿš€

#AI #MachineLearning #Webhooks #TechUpdate

03.12.2024 10:45 ๐Ÿ‘ 8 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Open Preference Dataset for Text-to-Image Generation by the ๐Ÿค— Community Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

๐Ÿ‘ Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation by the @hf.co community. This dataset contains 10K text-to-image preference pairs across image generation categories, using different model families and prompt complexities.

Blog: huggingface.co/blog/image-p...

09.12.2024 15:30 ๐Ÿ‘ 17 ๐Ÿ” 4 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Open Image Preferences released! ๐Ÿš€

- Open-source dataset for text2image
- 10K samples manually evaluated by the HF community.
- Binarized format for SFT, DPO, or ORPO.

It comes with a nice blog post explaining the steps to pre-process and generate the data, along with the results.

09.12.2024 16:26 ๐Ÿ‘ 5 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Added!!

07.12.2024 09:17 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

I'd love to yes!!

06.12.2024 21:48 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

thanks Pasquale, I remember you recommended the MMLU Redux paper when we started this project. I've been in charge of the human annotation / Argilla part and unfortunately didn't find the time to check this curation process

Could you share the pointer to the curated version to see what can be done?

06.12.2024 09:33 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
CohereForAI/Global-MMLU ยท Datasets at Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Open dataset: huggingface.co/datasets/Coh...
Paper: arxiv.org/pdf/2412.03304

06.12.2024 08:59 ๐Ÿ‘ 4 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Announcing Global-MMLU - an improved MMLU Open dataset with evaluation coverage across 42 languages.

The result of months of work with the goal of advancing Multilingual LLM evaluation.

Built together with the community and amazing collaborators at Cohere4AI, MILA, MIT, and many more.

06.12.2024 08:59 ๐Ÿ‘ 65 ๐Ÿ” 11 ๐Ÿ’ฌ 4 ๐Ÿ“Œ 1
Preview
Language Lead sign-up At Hugging Face ๐Ÿค—, we're launching a big community initiative to improve LLM training for many languages. We're looking for Language Leads to help us cultivate specific languages during this initiativ...

We're about to launch the biggest collaboration effort since the Open Assistant.

Let's get the highest quality data for open foundation models with all the nuances & diversity of each language, all with data provenance and transparency

Join us as language lead:
docs.google.com/forms/d/10XI...

03.12.2024 16:53 ๐Ÿ‘ 7 ๐Ÿ” 3 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Screenshot of a dashboard showing the number of languages with a lead and languages without a lead

Screenshot of a dashboard showing the number of languages with a lead and languages without a lead

Next week we're launching a collaborative annotation effort to build a big multilingual dataset, so you can have high-quality data in your language.

We are really close to getting leads for 100 languages! Can you help us cover the remaining 200?

03.12.2024 12:45 ๐Ÿ‘ 15 ๐Ÿ” 4 ๐Ÿ’ฌ 4 ๐Ÿ“Œ 0