One useful lesson: Muon is a reasonable optimizer.
But obviously if you're hill climbing, you never get to the taller hill.
One useful lesson: Muon is a reasonable optimizer.
But obviously if you're hill climbing, you never get to the taller hill.
Spent 1h back-and-forth with ChatGPT trying to pinpoint a configuration issue. It guessed all manners of reasonable issues, none of which were the right one.
Spent 5 min doing a reverse image search of the error. Someone on the Web had the same issue. Instant fix.
Friendly laser fire
@funranium.bsky.social is this a 3 digit count or 4 digit count of swear words situation?
Which website do you use to generate this video?
@duckduckgo.com Is there a way to add 1password to your browser? I don't see extensions.
That strongly implies that Mistral's next step is TTS. In fact, other tokens corroborate it: while [AUDIO] likely indicates that speech tokens follow, [REF] might indicate a reference voice pattern to copy, and [OUTPUT_AUDIO] might start converting text to audio.
It also outputs a [word] token, which fits the [STREAMING_WORD] token found in Voxtral 2.
Why have that?
For text-to-speech: there, when the model knows it has finished outputting the audio for a word, it generates the WORD action, so that we can feed it the next word to say.
But there are a lot more tokens in there that are unexplained!
To learn more, we can look at what inspired Mistral: the Kyutai Delayed Stream Modeling, arxiv.org/abs/2509.08753
It has the same delay design with the [pad] tokens.
Of course, the output does not contain exactly one word per text token, since the audio file does not contain exactly one word per 80ms.
The trick? Look at those new tokens: when the model needs to wait before outputting a word, it outputs a [STREAMING_PAD] token.
4. The audio token embedding history + delay tokens go through a Transformer to output a speech token. This is why the delay is variable: it can be any multiple of 80ms.
5. The history of speech token embeddings go through a Transformer to output a text token embedding → text token probs → text.
Look at its architecture:
1. The audio is cut into 80ms, which are sampled to 16 KHz (16000*0.08=1280 floats).
2. It is converted to a spectrogram,
3. A whisper-style encoder converts it to an audio token embedding through a convnet,
Shoutout to Voxtral 2, which really feels unparalleled in quality.
The interesting bit is its ability to do realtime transcription.
How does it do that, with a variable delay?
Announcement: huggingface.co/zai-org/GLM-...
Find the graphs here: metabench.organisons.com
On Math, honestly, it is impressive how close it is to GPT-OSS 20B and Gemini 3 Flash, even when it does not beat them.
All in all, one of the best local models out there. Architecturally, one of the most innovative.
Reasoning is another big purpose. Using MLA, it may be quite good at in-context reasoning on a large corpus, even locally.
But it won't be far above leading local models like Ministral 3. Meanwhile API models like Gemini 3 and DeepSeek will surpass it at the same price.
Where GLM-4.7 Flash shines is when you feed it enormous inputs.
That is typical of agentic coding tools. It’s on the Pareto frontier there.
Better than GPT-OSS 20B, cheaper and faster than Devstral Small 2.
What happened to Z.ai servers in December, for them to suddenly have a spikey boost in token throughput?!
One of my favorite findings: Positional embeddings are just training wheels. They help convergence but hurt long-context generalization.
We found that if you simply delete them after pretraining and recalibrate for <1% of the original budget, you unlock massive context windows. Smarter, not harder.
Phenomenal work.
I wonder about DroPE scaling laws: can it be executed after 4B pretraining tokens regardless of model size (and then the rest of pretraining does NoPE)? Or does it have to be done at the end of pretraining?
As always, find these comparisons at metabench.organisons.com
and the announcement at www.minimax.io/news/minimax...
Other metrics don't improve as much… but the M2 baseline was already quite good.
Keep in mind that this model is much faster than the others around it, clocking at 100 tokens per second compared to similar ones doing 30 tokens/sec.
M2.1 from @MiniMax__AI has a welcome jump in agentic coding! It matches @Zai_org’s GLM-4.7 released yesterday, but at a lower cost.
As always, the full leaderboard is here: metabench.organisons.com
And the announcement: z.ai/blog/glm-4.7
Other metrics are good, but the improvement is more marginal, such as in raw agentic use (typical of customer service):
As often, code training improves math as well, where we see a very positive jump!
Impressive jump on agentic coding according to its benchmarks! Now on par with Claude Opus 4.1 (from 5 months ago!), K2 Thinking, and GPT-5.2 Codex, at a lower cost.
A bit overshadowed by DeepSeek, whose DSA mechanisms achieve great cost cuts.
Looking at raw data: OpenAI claims a score of 44% on Terminal-Bench 2.0 for GPT-5.2 Codex.
Mistral gives GPT-5.1 Codex, the predecessor, a score of 52.8%, and Tbench gives it 57.8%.
Google gives GPT-5.1 (non-Codex) 47.6%, and matches it in Gemini 3 Flash.
There are few benchmarks yet for @OpenAI’s fresh GPT 5.2 Codex model.
Initial benchmarks from the announcement imply a drop below Gemini 3 Flash in agentic coding. In fact, the performance seems close to DeepSeek V3.2 at a 50x price jump.
As usual, you can find the leaderboard here: metabench.organisons.com
and the model card: storage.googleapis.com/deepmind-med...
Raw agentic behaviour, typically used for customer support, is where it is the least competitive.
• On the high end, Claude Sonnet 4.5 edges it out.
• On the low end, Ministral 3 14B is cheaper for similar results.
Yet still there it is on the Pareto frontier.