Tom Aarsen's Avatar

Tom Aarsen

@tomaarsen.com

Sentence Transformers, SetFit & NLTK maintainer Machine Learning Engineer at ๐Ÿค— Hugging Face

2,582
Followers
202
Following
414
Posts
14.11.2024
Joined
Posts Following

Latest posts by Tom Aarsen @tomaarsen.com

Definitely worth checking out, the authors have done very well here! I'm especially interested in seeing more 'context' models, they're very novel.

27.02.2026 14:35 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

The models are all MIT Licensed, i.e. commercially viable, and supported with Sentence Transformers, Text Embedding Inference, Transformers.js, etc.

๐Ÿงต

27.02.2026 14:35 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

The models have been evaluated on various benchmarks like MMTEB, MTEB(Code), MIRACL, BERGEN, ToolRet and ConTEB (for the context model), where they perform very well for their sizes.

๐Ÿงต

27.02.2026 14:35 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

They then turned this strategy into 4 models:
- 2 sizes: 0.6B and 4B parameters
- 2 types:
pplx-embed-v1 for dense embeddings,
pplx-embed-context-v1 for al dense embeddings that are computed with entire documents all at once: each chunk contains global document information!

๐Ÿงต

27.02.2026 14:35 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0
Post image

They first performed diffusion-style pretraining on Qwen3 to turn it into a bidirectional model. This allows every token to attend to every other token, even 'future' tokens further in the same text. Causal models (like most decoders) can only look at previous tokens.

๐Ÿงต

27.02.2026 14:35 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
pplx-embed - a perplexity-ai Collection Diffusion-Pretrained Dense and Contextual Embeddings

The models & paper: huggingface.co/collections/...

๐Ÿงต

27.02.2026 14:35 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿค— Perplexity has released 4 open-weights state-of-the-art multilingual embedding models designed for retrieval tasks!

pplx-embed-v1 and pplx-embed-context-v1

Specifically trained for int8 and binary embeddings, they'll be viable for massive search problems.

Details in ๐Ÿงต

27.02.2026 14:35 ๐Ÿ‘ 18 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

I've collaborated quite closely with the PyLate authors over the last months, as PyLate relies heavily on Sentence Transformers. This is very strong work, definitely worth checking out!

Kudos to @nohtow.bsky.social, Luca Arnaboldi, @amelietabatta.bsky.social and @krzakalaf.bsky.social.

23.02.2026 14:20 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

All models, including intermediate checkpoints for every training phase and configuration, are released under Apache 2.0. The strongest model, lightonai/ColBERT-Zero, is the new strongest late interaction model.

๐Ÿงต

23.02.2026 14:20 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Luckily, skipping the expensive unsupervised phase and simply adding a supervised contrastive step before distillation reaches 55.12 nDCG@10, which is 99.4% of ColBERT-Zero's performance at roughly 10x lower compute cost (40 vs 408 GH200-hours).

๐Ÿงต

23.02.2026 14:20 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

By running all contrastive pre-training phases directly in the multi-vector setting, via PyLate, LightOn could outperform the standard approach.

๐Ÿงต

23.02.2026 14:20 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

The key insight behind ColBERT-Zero is that the standard recipe for training ColBERT models, taking a strong dense model and bolting on a small knowledge distillation step, leaves a lot of performance on the table.

๐Ÿงต

23.02.2026 14:20 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
ColBERT-Zero ๐Ÿถ - a lightonai Collection First large-scale fully pre-trained ColBERT model using only public data, outperforming GTE-ModernColBERT and GTE-ModernBERT

Check out the models and paper here: huggingface.co/collections/...

๐Ÿงต

23.02.2026 14:20 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
**ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models?** A Blog post by LightOn AI on Hugging Face

Give the detailed blogpost a read: huggingface.co/blog/lighton...

๐Ÿงต

23.02.2026 14:20 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿš€ LightOn is back with a SOTA late-interaction model for search: ColBERT-Zero!

By performing contrastive pre-training directly in the multi-vector setting, it outperforms GTE-ModernColBERT etc. on BEIR, using only public data and reaching 55.43 nDCG@10.

Details in ๐Ÿงต

23.02.2026 14:20 ๐Ÿ‘ 11 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

ggml / llama.cpp are joining @hf.co, ensuring it'll stay open, maintained, and up to date for a long long time! ๐Ÿš€

huggingface.co/blog/ggml-jo...

20.02.2026 14:55 ๐Ÿ‘ 9 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Paper page - jina-embeddings-v5-text: Task-Targeted Embedding Distillation Join the discussion on this paper page

Great work by the Jina team. The paper is also extremely interesting, using a lot of different losses and providing valuable ablations. If you're into training embedding models, definitely give it a read.

huggingface.co/papers/2602....

19.02.2026 14:54 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The only downside is that the models are licensed under cc-by-nc-4.0. You'll have to contact Jina if you'd like to use these for commercial use.

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

The models each run with Sentence Transformers, Transformers, Jina's API, Text Embedding Inference, vLLM, Llama.cpp, and MLX. Super useful!

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Beyond the two models with multiple task adapters, you can also directly load the model with one of the adapters applied, e.g. 'jinaai/jina-embeddings-v5-text-small-retrieval'.
This is especially nice if you want to avoid 'trust_remote_code'.

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

The models are also competitive on English only, performing very well for their sizes. You love to see it.

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Multilingual Retrieval performance:
jina-v5-text-small outperforms Qwen3-Embedding-0.6B for effectively the same model size, and reaches much higher scores than any other model at <1B parameters.

jina-v5-text-nano also outperforms everything up to twice its parameter size.

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Both models were trained and evaluated on numerous languages, and so they're strong new multilingual options.
They're also trained using a clever adapter-switching system. You can select either retrieval, text-matching, classification, or clustering, depending on your task.

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

jina-embeddings-v5-text-nano:

- 239M parameters, 8k sequence length, 768 dimensionality
- The embeddings can be truncated to 32, 64, 128, 256, 512, 768 via its Matryoshka support
- Base model is EuroBERT/EuroBERT-210m

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

jina-embeddings-v5-text-small:

- 677M parameters, 32k sequence length, 1024 dimensionality
- The embeddings can be truncated to 32, 64, 128, 256, 512, 768, 1024 via its Matryoshka support
- Base model is Qwen/Qwen3-0.6B-Base

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
jina-embeddings-v5-text - a jinaai Collection Our 5th-gen embeddings: two lightweight multilingual models with SOTA performance in retrieval, matching, clustering, and classification.

Check out the models here: huggingface.co/collections/...

๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿ‘ Jina AI is back with new state-of-the-art multilingual embedding models for retrieval & more:

jina-embedding-v5-text!

2 efficient sizes, 239M & 677M, they outperform Qwen3-embedding, EmbeddingGemma-300m, multilingual-e5-large, etc.

Details in ๐Ÿงต

19.02.2026 14:54 ๐Ÿ‘ 6 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Post image

More embedding models and an even more reliable inference engine is what you get with @hf.co Text Embeddings Inference v1.9.0 ๐Ÿ’ฅ

More in the thread ๐Ÿงต

17.02.2026 16:05 ๐Ÿ‘ 3 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Release v5.2.3 - Compatibility with Transformers v5.2 training ยท huggingface/sentence-transformers This patch release introduces compatibility with Transformers v5.2. Install this version with # Training + Inference pip install sentence-transformers[train]==5.2.3 # Inference only, use one of: p...

More details in the release notes: github.com/huggingface/...

17.02.2026 14:13 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Transformers v5.2 updated some behind the scenes methods for its Trainer that Sentence Transformers relies on for logging metrics.

So, if you update to Transformers v5.2 with an older Sentence Transformers version, you'll encounter crashes when a metric is logged.

๐Ÿงต

17.02.2026 14:13 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0