Loubna Ben Allal (@loubnabnl.hf.co)

We built code datasets, English datasets, and now it’s time for math! 🚀

Check out Anton’s thread to learn how we curated the best public math pre-training dataset.

19.12.2024 16:02 👍 4 🔁 0 💬 0 📌 0

Yeah it was recorded, I will share it when it’s public

12.12.2024 17:42 👍 1 🔁 0 💬 0 📌 0

Synthetic data & Smol models in 2024 Loubna Ben Allal Hugging Face Synthetic data and smol models in 2024 loubnabnl LoubnaBenAllal1

Sharing my slides on "Synthetic data and smol models in 2024" from yesterday's Latent Space event at NeurIPS: docs.google.com/presentation...

- Synthetic data is everywhere
- Model collapse, is the web polluted?
- 3B+ models running on your iPhone
- When and why use smol models?

12.12.2024 14:54 👍 23 🔁 5 💬 1 📌 0

Another great talk at @latentspacepod.bsky.social NeurIPS: @loubnabnl.hf.co on Synthetic Data & Smol Models

11.12.2024 22:45 👍 4 🔁 2 💬 0 📌 0

For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol.

🧵>>

03.12.2024 09:21 👍 326 🔁 64 💬 9 📌 4

The amazing, new Qwen2.5-Coder 32B model can now write SQL for any @hf.co dataset ✨

02.12.2024 12:48 👍 18 🔁 3 💬 1 📌 0

We hit 1K ⭐ on our SmolLM repo—thank you! 🎉 New updates:

• SmolLM2 nanotron checkpoints (with optimizer states) for easier continual pre-training
• Local inference demos (MLC, Transformers.js, MLX, llama.cpp)
• SmolVLM: Vision-language model built on SmolLM2

github.com/huggingface/...

01.12.2024 07:59 👍 18 🔁 1 💬 0 📌 0

x.com

In this demo Andi used SmolLM2 to summarize a long email, asked it follow-up questions - and then used it to rewrite his reply as a formal email: x.com/andi_marafio...

30.11.2024 15:58 👍 0 🔁 0 💬 0 📌 0

smollm/smol_tools at main · huggingface/smollm Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm

📬 Summarize and rewrite your text/emails faster, and offline!

Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...

30.11.2024 15:58 👍 11 🔁 2 💬 1 📌 0

WOW! 🤯 Language models are becoming smaller and more capable than ever! Here's SmolLM2 running 100% locally in-browser w/ WebGPU on a 6-year-old GPU. Just look at that speed! ⚡️😍

Powered by 🤗 Transformers.js and ONNX Runtime Web!

How many tokens/second do you get? Let me know! 👇

27.11.2024 13:51 👍 46 🔁 10 💬 2 📌 3

This demo of structured data extraction running on an LLM that executes entirely in the browser (Chrome only for the moment since it uses WebGPU) is amazing

My notes here: simonwillison.net/2024/Nov/29/...

29.11.2024 21:10 👍 182 🔁 23 💬 4 📌 2

Fuck it! Structured Generation w/ SmolLM2 running in browser & WebGPU 🔥

Powered by MLC Web-LLM & XGrammar ⚡

Define a JSON schema, Input free text, get structured data right in your browser - profit!!

28.11.2024 22:24 👍 107 🔁 13 💬 4 📌 1

ML Research Engineer Internship, SmolLMs pretraining and datasets - EMEA Remote - Hugging Face Here at Hugging Face, we’re on a journey to advance good Machine Learning and make it more accessible. Along the way, we contribute to the development of technology for the better.We have built the fa...

We’re looking for an intern to join our SmolLM team! If you’re excited about training LLMs and building high-quality datasets, we’d love to hear from you. 🤗

US: apply.workable.com/huggingface/...
EMEA: apply.workable.com/huggingface/...

27.11.2024 10:20 👍 64 🔁 12 💬 7 📌 2

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!

26.11.2024 15:57 👍 104 🔁 22 💬 4 📌 4

We use open Llama models for generating our new datasets and refer users to the original license of the existing datasets

26.11.2024 11:05 👍 0 🔁 0 💬 2 📌 0

A screenshot of LightEval benchmarking results in a terminal

Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend

25.11.2024 17:24 👍 77 🔁 10 💬 2 📌 1

It's Sunday morning so taking a minute for a nerdy thread (on math, tokenizers and LLMs) of the work of our intern Garreth

By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮

[thread]

24.11.2024 11:05 👍 273 🔁 34 💬 5 📌 5

GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?

24.11.2024 07:16 👍 59 🔁 10 💬 2 📌 0

HuggingFaceTB/smoltalk · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!

The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.

Check out the dataset:
huggingface.co/datasets/Hug...

21.11.2024 15:22 👍 24 🔁 8 💬 1 📌 1

What's the secret sauce of SmolLM2 to beat LLM titans like Llama3.2 and Qwen2.5?

Unsurprisingly: data, data, data!

The SmolTalk is open and available here: huggingface.co/datasets/Hug...

21.11.2024 14:17 👍 62 🔁 7 💬 2 📌 1

Loubna Ben Allal

Latest posts by Loubna Ben Allal @loubnabnl.hf.co