Quentin Lhoest πŸ€—'s Avatar

Quentin Lhoest πŸ€—

@lhoestq.hf.co

Datasets @ Hugging Face | Open Source + HF Dataset Hub

543
Followers
54
Following
6
Posts
22.11.2024
Joined
Posts Following

Latest posts by Quentin Lhoest πŸ€— @lhoestq.hf.co

cc @julien.ledem.net the blog post is quite cool imo :)

25.07.2025 16:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Parquet Content-Defined Chunking We’re on a journey to advance and democratize artificial intelligence through open source and open science.

New blog post 🚨 Every data engineer should read it

@kszucs.bsky.social (Apache Arrow PMC member) announces how to drastically speed up Parquet files uploads and downloads via deduplication.

Best part: the feature enabling this is open source !
huggingface.co/blog/parquet...

25.07.2025 16:06 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

It also speeds up files downloads and uploads, since now you only need to move the differentiating data around :)

find more about Xet here: huggingface.co/blog/xet-on-...

16.05.2025 15:38 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

This writer outputs Parquet files that are robust to insertions/deletions/edits

Which means versioned datasets cost only a fraction of their original storage ! πŸ”₯🀯

e.g. if you store with Xet, which deduplicates files by chunk

cc @julien.ledem.net FYI

16.05.2025 15:38 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

you can define it this way:

>>> import pyarrow.parquet as pq
>>> writer = pq.ParquetWriter(
... out, schema,
... use_content_defined_chunking=True,
... )

16.05.2025 15:38 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

CDC Parquet writer is out in PyArrow nightlies πŸ”₯πŸ”₯

$ pip install \
-i pypi.anaconda.org/scientific-p... \
"pyarrow>=21.0.0.dev0"

it's changing the way I view data versioningπŸ‘‡

16.05.2025 15:38 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

πŸ€– We are thrilled to announce AgiBot World, the first large-scale robotic learning dataset designed to advance multi-purpose humanoid policies!

Github:
github.com/OpenDriveLab...

HuggingFace:
huggingface.co/agibot-world

30.12.2024 10:48 πŸ‘ 7 πŸ” 3 πŸ’¬ 0 πŸ“Œ 3
Post image

SuperCharged Euclid is on πŸ€— Hugging Face

Also, this is the best paper heading I’ve seen in quite some time. The 'en tΓͺte' looks fantastic.

(⚑Llama 3.3) Chat with the paper: huggingface.co/spaces/hugg...
πŸ€— Model: huggingface.co/euclid-mult...
πŸ€— Dataset: huggingface.co/datasets/eu...

13.12.2024 17:51 πŸ‘ 9 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Post image

We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute πŸ”₯

How? By combining step-wise reward models with tree search algorithms :)

We're open sourcing the full recipe and sharing a detailed blog post πŸ‘‡

16.12.2024 17:08 πŸ‘ 109 πŸ” 21 πŸ’¬ 4 πŸ“Œ 1
SQL Console on Hugging Face with the AI Query overlay

SQL Console on Hugging Face with the AI Query overlay

In-place Assistants > Chat windows!
Hugging Face's integration of an "AI Query" overlay in their SQL console exemplifies this. Users input natural language, AI suggests SQL queriesβ€”streamlining data exploration seamlessly. Probably the best showcase of this pattern in a freely accessible product.

05.12.2024 11:20 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Dataset Spreadsheets - a Hugging Face Space by lhoestq Discover amazing ML apps made by the community

Spreadsheet folk are welcome on the @hf.co hub too!

@lhoestq.hf.co
https://buff.ly/3VAEYKW

14.12.2024 11:00 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

πŸš€ Introducing INCLUDE 🌍: A multilingual LLM evaluation benchmark spanning 44 languages!

Contains *newly-collected* data, prioritizing *regional knowledge*.
Setting the stage for truly global AI evaluation.
Ready to see how your model measures up?
#AI #Multilingual #LLM #NLProc

02.12.2024 15:52 πŸ‘ 38 πŸ” 6 πŸ’¬ 1 πŸ“Œ 5
Preview
bluesky-community (Bluesky Community) Tools for Bluesky πŸ¦‹

The AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather @bsky.app resources on @huggingface.bsky.social. I've established a community org πŸ€— huggingface.co/bluesky-comm...

25.11.2024 15:59 πŸ‘ 159 πŸ” 33 πŸ’¬ 10 πŸ“Œ 4
Preview
bluesky-community/one-million-bluesky-posts Β· Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

First dataset for the new @huggingface.bsky.social @bsky.app community organisation: one-million-bluesky-posts πŸ¦‹

πŸ“Š 1M public posts from Bluesky's firehose API
πŸ” Includes text, metadata, and language predictions
πŸ”¬ Perfect to experiment with using ML for Bluesky πŸ€—

huggingface.co/datasets/blu...

26.11.2024 13:50 πŸ‘ 528 πŸ” 73 πŸ’¬ 701 πŸ“Œ 467
Post image

Open Source Post Training is going strong! In last 2 weeks, we got data or recipes released for OpenCoder, SmolLM-2, Orca Agent Instruct, and TΓΌlu 3. Read it, learn, and iterate:

23.11.2024 07:45 πŸ‘ 34 πŸ” 5 πŸ’¬ 1 πŸ“Œ 1