Tom Aarsen (@tomaarsen.com)

Definitely worth checking out, the authors have done very well here! I'm especially interested in seeing more 'context' models, they're very novel.

27.02.2026 14:35 👍 0 🔁 0 💬 0 📌 0

The models are all MIT Licensed, i.e. commercially viable, and supported with Sentence Transformers, Text Embedding Inference, Transformers.js, etc.

🧵

27.02.2026 14:35 👍 1 🔁 0 💬 1 📌 0

The models have been evaluated on various benchmarks like MMTEB, MTEB(Code), MIRACL, BERGEN, ToolRet and ConTEB (for the context model), where they perform very well for their sizes.

🧵

27.02.2026 14:35 👍 0 🔁 0 💬 1 📌 0

They then turned this strategy into 4 models:
- 2 sizes: 0.6B and 4B parameters
- 2 types:
pplx-embed-v1 for dense embeddings,
pplx-embed-context-v1 for al dense embeddings that are computed with entire documents all at once: each chunk contains global document information!

🧵

27.02.2026 14:35 👍 1 🔁 0 💬 2 📌 0

They first performed diffusion-style pretraining on Qwen3 to turn it into a bidirectional model. This allows every token to attend to every other token, even 'future' tokens further in the same text. Causal models (like most decoders) can only look at previous tokens.

🧵

27.02.2026 14:35 👍 0 🔁 0 💬 1 📌 0

pplx-embed - a perplexity-ai Collection Diffusion-Pretrained Dense and Contextual Embeddings

The models & paper: huggingface.co/collections/...

🧵

27.02.2026 14:35 👍 0 🔁 0 💬 1 📌 0

🤗 Perplexity has released 4 open-weights state-of-the-art multilingual embedding models designed for retrieval tasks!

pplx-embed-v1 and pplx-embed-context-v1

Specifically trained for int8 and binary embeddings, they'll be viable for massive search problems.

Details in 🧵

27.02.2026 14:35 👍 18 🔁 1 💬 1 📌 0

I've collaborated quite closely with the PyLate authors over the last months, as PyLate relies heavily on Sentence Transformers. This is very strong work, definitely worth checking out!

Kudos to @nohtow.bsky.social, Luca Arnaboldi, @amelietabatta.bsky.social and @krzakalaf.bsky.social.

23.02.2026 14:20 👍 0 🔁 0 💬 0 📌 0

All models, including intermediate checkpoints for every training phase and configuration, are released under Apache 2.0. The strongest model, lightonai/ColBERT-Zero, is the new strongest late interaction model.

🧵

23.02.2026 14:20 👍 0 🔁 0 💬 1 📌 0

Luckily, skipping the expensive unsupervised phase and simply adding a supervised contrastive step before distillation reaches 55.12 nDCG@10, which is 99.4% of ColBERT-Zero's performance at roughly 10x lower compute cost (40 vs 408 GH200-hours).

🧵

23.02.2026 14:20 👍 0 🔁 0 💬 1 📌 0

By running all contrastive pre-training phases directly in the multi-vector setting, via PyLate, LightOn could outperform the standard approach.

🧵

23.02.2026 14:20 👍 0 🔁 0 💬 1 📌 0

The key insight behind ColBERT-Zero is that the standard recipe for training ColBERT models, taking a strong dense model and bolting on a small knowledge distillation step, leaves a lot of performance on the table.

🧵

23.02.2026 14:20 👍 0 🔁 0 💬 1 📌 0

ColBERT-Zero 🐶 - a lightonai Collection First large-scale fully pre-trained ColBERT model using only public data, outperforming GTE-ModernColBERT and GTE-ModernBERT

Check out the models and paper here: huggingface.co/collections/...

🧵

23.02.2026 14:20 👍 1 🔁 0 💬 1 📌 0

**ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models?** A Blog post by LightOn AI on Hugging Face

Give the detailed blogpost a read: huggingface.co/blog/lighton...

🧵

23.02.2026 14:20 👍 2 🔁 0 💬 1 📌 0

🚀 LightOn is back with a SOTA late-interaction model for search: ColBERT-Zero!

By performing contrastive pre-training directly in the multi-vector setting, it outperforms GTE-ModernColBERT etc. on BEIR, using only public data and reaching 55.43 nDCG@10.

Details in 🧵

23.02.2026 14:20 👍 11 🔁 0 💬 1 📌 0

ggml / llama.cpp are joining @hf.co, ensuring it'll stay open, maintained, and up to date for a long long time! 🚀

huggingface.co/blog/ggml-jo...

20.02.2026 14:55 👍 9 🔁 0 💬 1 📌 0

Paper page - jina-embeddings-v5-text: Task-Targeted Embedding Distillation Join the discussion on this paper page

Great work by the Jina team. The paper is also extremely interesting, using a lot of different losses and providing valuable ablations. If you're into training embedding models, definitely give it a read.

huggingface.co/papers/2602....

19.02.2026 14:54 👍 1 🔁 0 💬 0 📌 0

The only downside is that the models are licensed under cc-by-nc-4.0. You'll have to contact Jina if you'd like to use these for commercial use.

🧵

19.02.2026 14:54 👍 0 🔁 0 💬 1 📌 0

The models each run with Sentence Transformers, Transformers, Jina's API, Text Embedding Inference, vLLM, Llama.cpp, and MLX. Super useful!

🧵

19.02.2026 14:54 👍 0 🔁 0 💬 1 📌 0

Beyond the two models with multiple task adapters, you can also directly load the model with one of the adapters applied, e.g. 'jinaai/jina-embeddings-v5-text-small-retrieval'.
This is especially nice if you want to avoid 'trust_remote_code'.

🧵

19.02.2026 14:54 👍 0 🔁 0 💬 1 📌 0

The models are also competitive on English only, performing very well for their sizes. You love to see it.

🧵

19.02.2026 14:54 👍 0 🔁 0 💬 1 📌 0

Multilingual Retrieval performance:
jina-v5-text-small outperforms Qwen3-Embedding-0.6B for effectively the same model size, and reaches much higher scores than any other model at <1B parameters.

jina-v5-text-nano also outperforms everything up to twice its parameter size.

🧵

19.02.2026 14:54 👍 0 🔁 0 💬 1 📌 0

Both models were trained and evaluated on numerous languages, and so they're strong new multilingual options.
They're also trained using a clever adapter-switching system. You can select either retrieval, text-matching, classification, or clustering, depending on your task.

🧵

19.02.2026 14:54 👍 1 🔁 0 💬 1 📌 0

jina-embeddings-v5-text-nano:

- 239M parameters, 8k sequence length, 768 dimensionality
- The embeddings can be truncated to 32, 64, 128, 256, 512, 768 via its Matryoshka support
- Base model is EuroBERT/EuroBERT-210m

🧵

19.02.2026 14:54 👍 2 🔁 0 💬 1 📌 0

jina-embeddings-v5-text-small:

- 677M parameters, 32k sequence length, 1024 dimensionality
- The embeddings can be truncated to 32, 64, 128, 256, 512, 768, 1024 via its Matryoshka support
- Base model is Qwen/Qwen3-0.6B-Base

🧵

19.02.2026 14:54 👍 1 🔁 0 💬 1 📌 0

jina-embeddings-v5-text - a jinaai Collection Our 5th-gen embeddings: two lightweight multilingual models with SOTA performance in retrieval, matching, clustering, and classification.

Check out the models here: huggingface.co/collections/...

🧵

19.02.2026 14:54 👍 2 🔁 0 💬 1 📌 0

👏 Jina AI is back with new state-of-the-art multilingual embedding models for retrieval & more:

jina-embedding-v5-text!

2 efficient sizes, 239M & 677M, they outperform Qwen3-embedding, EmbeddingGemma-300m, multilingual-e5-large, etc.

Details in 🧵

19.02.2026 14:54 👍 6 🔁 0 💬 1 📌 1

More embedding models and an even more reliable inference engine is what you get with @hf.co Text Embeddings Inference v1.9.0 💥

More in the thread 🧵

17.02.2026 16:05 👍 3 🔁 3 💬 1 📌 0

Release v5.2.3 - Compatibility with Transformers v5.2 training · huggingface/sentence-transformers This patch release introduces compatibility with Transformers v5.2. Install this version with # Training + Inference pip install sentence-transformers[train]==5.2.3 # Inference only, use one of: p...

More details in the release notes: github.com/huggingface/...

17.02.2026 14:13 👍 0 🔁 0 💬 0 📌 0

Transformers v5.2 updated some behind the scenes methods for its Trainer that Sentence Transformers relies on for logging metrics.

So, if you update to Transformers v5.2 with an older Sentence Transformers version, you'll encounter crashes when a metric is logged.

🧵

17.02.2026 14:13 👍 0 🔁 0 💬 1 📌 0

Tom Aarsen

Latest posts by Tom Aarsen @tomaarsen.com