Agentic tools like OpenClaw grow with every PR β but bigger codebases are harder for AI to understand and extend.
What if we kept the core tiny and let agents adapt themselves to user needs? No PR, just evolve.
Agentic tools like OpenClaw grow with every PR β but bigger codebases are harder for AI to understand and extend.
What if we kept the core tiny and let agents adapt themselves to user needs? No PR, just evolve.
Here is my take on new DeepSeek-V3.2-Exp
erogol.substack.com/p/model-chec...
My post on MiMo-Audio
open.substack.com/pub/erogol/p...
π₯ Trained on 100M+ hours and shows emergent few-shot learning:
β’ Voice conversion
β’ Emotion transferβ’ Speech translation
β’ Cross-modal reasoning
β‘ Key finding: Speech follows same scaling laws as text LLMs
Machine Learns #55 is out!
Full of new models⦠check it out
open.substack.com/pub/erogol/p...
machine learns #54 is out
open.substack.com/pub/erogol/p...
My breakdown of VibeVoice - new open-weight TTS model from Microsoft.
open.substack.com/pub/erogol/p...
ms released a tts modelβ¦ niceβ¦
You can create long form convos and podcasts with 4 distinct voice
huggingface.co/microsoft/Vi...
KyutaiTTS solved streaming text-to-speech with a state machine that generates audio word-by-word as text arrives.
220ms latency, 10-second voice cloning, 32 concurrent users on single GPU.
No more waiting for complete sentences.
Full analysis: erogol.substack.com/p/model-chec...
This is such a great idea
claude is the best coding model
gemini cause frequent syntax errors
openai does not even understand the task at hand
lately spending sometime with Diffusion LMs and working on NanoGPT style LlaDA model
so far I've not achieved comparable results to AR models but its a good start
github.com/erogol/BlaGP...
This work was done in collaboration with Jeff Cluneβs lab at UBC, and led by his PhD students Jenny Zhang and Shengran Hu, together with Cong Lu and Robert Lange.
Paper: arxiv.org/abs/2505.22954
Code: github.com/jennyzzt/dgm
β‘ Machine Learns issue 48 is out
π dKV-Cache accelerates diffusion models up to 10x faster
π OpenAI's authentication play (think OAuth for AI)
π― PaTH Attention beats RoPE on long-context tasks
π€ Humanoid Robot fights became real
open.substack.com/pub/erogol/p...
Following the bread crumbs, implemented PLE from Gemma3n.
It gave a significant performance boost and resulted in a new best model with almost no compute overhead.
github.com/erogol/BlaGPT
My paper notes on 2 new papers
- Model Merging in Pre-training of Large Language Models,
- Do Not Let Low-Probability Tokens Over-Dominate in RL,
open.substack.com/pub/erogol/p...
muon really works. got best results in BlaGPT
```
torchrun --standalone --nproc_per_node=8 train.py --run_name best_model --model_name best
```
github.com/erogol/BlaGPT
All code is available in BlaGPT if you want to check it out yourself!
github.com/erogol/BlaGPT
My results:
β’ Canon Layers definitely improved performance when placed before Attention/MLP blocks
β’ Softpick had worse validation loss but completely removed attention sinks
β’ Parallel blocks matched baseline performance but trained 15% faster
Parallel Transformer blocks run MLP and Attention in parallel instead of one after another.
So you get: z = x + MLP(x) + Attention(x)
PaLM models use this approach, which improves memory usage and speed without hurting performance.
The Canon Layers paper shows they boost performance when added to transformer blocks.
They also help models without positional encoding work just as well as RoPE models.
βWorth noting that RWKV used a similar idea years ago.
Canon Layers are basically causal 1D convolutions that mix the current hidden state with previous states (how many depends on the kernel size).
Softpick replaces regular softmax in attention blocks.
It allows zero values in the numerator and lets negative values contribute to the denominator.
This prevents attention sinks while keeping math properties similar to regular softmax.
π§΅ Here is a small thread with my notes about some of the recent Transformer papers.
- Softpick: an alternative to softmax in Attention
- Canon Layers: mixing states with conv1d
- Parallel Transformer blocks
Machine learns #45 - no fluff AI newsletter - is out!
I normally share bi-weekly but last week was full enough so here we go
open.substack.com/pub/erogol/p...
Updated my LLM usage and cancelled ChatGPT sub for now
Coding - Claude, Gemini 2.5
Reading papers - Claude
Research - Gemini 2.5
Daily - Gemini 2.5
Search - Gemini 2.5
Thanks :)
Machine Learns #44 is out !!
click for no fluff AI newsletter
erogol.substack.com/p/machine-le...
Next big thing is Brain-LLMs.
Imagine an LLM compressing all world knowledge attached to your brain and ready to serve your thoughts and questions.
You also update it over internet and pay for sub. I don't want to think about the ad business :)
βIf these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.β
arxiv.org/abs/2503.14499
Itβs crazy that Gemma3 held up for only about three days