We built code datasets, English datasets, and now it’s time for math! 🚀
Check out Anton’s thread to learn how we curated the best public math pre-training dataset.
We built code datasets, English datasets, and now it’s time for math! 🚀
Check out Anton’s thread to learn how we curated the best public math pre-training dataset.
Yeah it was recorded, I will share it when it’s public
Sharing my slides on "Synthetic data and smol models in 2024" from yesterday's Latent Space event at NeurIPS: docs.google.com/presentation...
- Synthetic data is everywhere
- Model collapse, is the web polluted?
- 3B+ models running on your iPhone
- When and why use smol models?
Another great talk at @latentspacepod.bsky.social NeurIPS: @loubnabnl.hf.co on Synthetic Data & Smol Models
For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol.
🧵>>
The amazing, new Qwen2.5-Coder 32B model can now write SQL for any @hf.co dataset ✨
We hit 1K ⭐ on our SmolLM repo—thank you! 🎉 New updates:
• SmolLM2 nanotron checkpoints (with optimizer states) for easier continual pre-training
• Local inference demos (MLC, Transformers.js, MLX, llama.cpp)
• SmolVLM: Vision-language model built on SmolLM2
github.com/huggingface/...
In this demo Andi used SmolLM2 to summarize a long email, asked it follow-up questions - and then used it to rewrite his reply as a formal email: x.com/andi_marafio...
📬 Summarize and rewrite your text/emails faster, and offline!
Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...
WOW! 🤯 Language models are becoming smaller and more capable than ever! Here's SmolLM2 running 100% locally in-browser w/ WebGPU on a 6-year-old GPU. Just look at that speed! ⚡️😍
Powered by 🤗 Transformers.js and ONNX Runtime Web!
How many tokens/second do you get? Let me know! 👇
This demo of structured data extraction running on an LLM that executes entirely in the browser (Chrome only for the moment since it uses WebGPU) is amazing
My notes here: simonwillison.net/2024/Nov/29/...
Fuck it! Structured Generation w/ SmolLM2 running in browser & WebGPU 🔥
Powered by MLC Web-LLM & XGrammar ⚡
Define a JSON schema, Input free text, get structured data right in your browser - profit!!
We’re looking for an intern to join our SmolLM team! If you’re excited about training LLMs and building high-quality datasets, we’d love to hear from you. 🤗
US: apply.workable.com/huggingface/...
EMEA: apply.workable.com/huggingface/...
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!
We use open Llama models for generating our new datasets and refer users to the original license of the existing datasets
A screenshot of LightEval benchmarking results in a terminal
Check out how easy it is to do LLM evals with LightEval!
* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
It's Sunday morning so taking a minute for a nerdy thread (on math, tokenizers and LLMs) of the work of our intern Garreth
By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮
[thread]
Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...
Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos
Apache 2.0. V2 data mix coming soon!
Which tools should we add next?
Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!
The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.
Check out the dataset:
huggingface.co/datasets/Hug...
What's the secret sauce of SmolLM2 to beat LLM titans like Llama3.2 and Qwen2.5?
Unsurprisingly: data, data, data!
The SmolTalk is open and available here: huggingface.co/datasets/Hug...