cc @julien.ledem.net the blog post is quite cool imo :)
cc @julien.ledem.net the blog post is quite cool imo :)
New blog post π¨ Every data engineer should read it
@kszucs.bsky.social (Apache Arrow PMC member) announces how to drastically speed up Parquet files uploads and downloads via deduplication.
Best part: the feature enabling this is open source !
huggingface.co/blog/parquet...
It also speeds up files downloads and uploads, since now you only need to move the differentiating data around :)
find more about Xet here: huggingface.co/blog/xet-on-...
This writer outputs Parquet files that are robust to insertions/deletions/edits
Which means versioned datasets cost only a fraction of their original storage ! π₯π€―
e.g. if you store with Xet, which deduplicates files by chunk
cc @julien.ledem.net FYI
you can define it this way:
>>> import pyarrow.parquet as pq
>>> writer = pq.ParquetWriter(
... out, schema,
... use_content_defined_chunking=True,
... )
CDC Parquet writer is out in PyArrow nightlies π₯π₯
$ pip install \
-i pypi.anaconda.org/scientific-p... \
"pyarrow>=21.0.0.dev0"
it's changing the way I view data versioningπ
π€ We are thrilled to announce AgiBot World, the first large-scale robotic learning dataset designed to advance multi-purpose humanoid policies!
Github:
github.com/OpenDriveLab...
HuggingFace:
huggingface.co/agibot-world
SuperCharged Euclid is on π€ Hugging Face
Also, this is the best paper heading Iβve seen in quite some time. The 'en tΓͺte' looks fantastic.
(β‘Llama 3.3) Chat with the paper: huggingface.co/spaces/hugg...
π€ Model: huggingface.co/euclid-mult...
π€ Dataset: huggingface.co/datasets/eu...
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute π₯
How? By combining step-wise reward models with tree search algorithms :)
We're open sourcing the full recipe and sharing a detailed blog post π
SQL Console on Hugging Face with the AI Query overlay
In-place Assistants > Chat windows!
Hugging Face's integration of an "AI Query" overlay in their SQL console exemplifies this. Users input natural language, AI suggests SQL queriesβstreamlining data exploration seamlessly. Probably the best showcase of this pattern in a freely accessible product.
Spreadsheet folk are welcome on the @hf.co hub too!
@lhoestq.hf.co
https://buff.ly/3VAEYKW
π Introducing INCLUDE π: A multilingual LLM evaluation benchmark spanning 44 languages!
Contains *newly-collected* data, prioritizing *regional knowledge*.
Setting the stage for truly global AI evaluation.
Ready to see how your model measures up?
#AI #Multilingual #LLM #NLProc
The AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather @bsky.app resources on @huggingface.bsky.social. I've established a community org π€ huggingface.co/bluesky-comm...
First dataset for the new @huggingface.bsky.social @bsky.app community organisation: one-million-bluesky-posts π¦
π 1M public posts from Bluesky's firehose API
π Includes text, metadata, and language predictions
π¬ Perfect to experiment with using ML for Bluesky π€
huggingface.co/datasets/blu...
Open Source Post Training is going strong! In last 2 weeks, we got data or recipes released for OpenCoder, SmolLM-2, Orca Agent Instruct, and TΓΌlu 3. Read it, learn, and iterate: