Recorded a podcast, think itβs pretty good and comprehensive, hope you like it ;) youtu.be/EV7WhVT270Q?...
@sebastianraschka.com
ML/AI researcher & former stats professor turned LLM research engineer. Author of "Build a Large Language Model From Scratch" (https://amzn.to/4fqvn0D) & reasoning (https://mng.bz/Nwr7). Also blogging about AI research at magazine.sebastianraschka.com.
Recorded a podcast, think itβs pretty good and comprehensive, hope you like it ;) youtu.be/EV7WhVT270Q?...
Been a while since I did an LLM architecture post. Just stumbled upon the Arcee AI Trinity Large release + technical report released yesterday and couldn't resist :)
Also added a new section to my LLM architecture comparison article with more details: magazine.sebastianraschka.com/i/168650848/20
Been pretty heads-down finishing Chapter 6 on implementing RLVR via GRPO. Just finished, and it might be my favorite chapter so far.
Code notebook: github.com/rasbt/reason...
(And it should be added to the early access soon.)
The next chapter adds stability and performance improvements to GRPO.
For the past month or so, I've been slowly working through this book by @sebastianraschka.com which theoretically and practically builds a GPT model from scratch. Highly recommended!
Ironically, I'm writing much more code by hand as a result
Ha, thanks for the kind compliment!
Ha, thanks! Happy new year to you as well!
Thanks! Is /r/machinelearning still weekend only for unless it's an arxiv article?
Uploaded my State of LLMs 2025 report for this year:
magazine.sebastianraschka.com/p/state-of-l...
I planned to just write a brief overview, but yeah, it was an eventful year so it was impossible to keep it below 7000 words :D.
This is an opinion. That's why I prefaced my post with "I think of it as this"
One of the underrated papers this year:
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)
(I can confirm this holds for RLVR, too! I have some experiments to share soon.)
I agree. I was thinking of βfasterβ because it frees time when letting it do boilerplate stuff. And I was thinking of βbetterβ as in using it to find issues that were accidentally overlooked.
Yeah. My point was that LLMs are good amplifiers, but they are not the only tool one should use and learn from.
It's a cycle: Coding manually, reading resources written by experts, looking at high-quality projects built by experts, getting advice from experts, and repeat...
I think of it as this: LLMs lower the barrier of entry, and they make coders (beginners and experts) more productive.
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.
I discuss the more historical building blocks here if you are interested (going back to "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Neural Networks" 1991 by Schmidhuber): magazine.sebastianraschka.com/p/understand...
Yes yes. This is not a complete history.
I assume you are specifically referring to the first line β202xβ¦β? I merely wanted to say that the focus in the early 2020s was more on pre-training than anything else then. (I think the term LLM wasnβt coined until the 175B GPT-3 model came out).
The LLM eras:
202x Pre-training (foundation)
2022 RLHF + PPO
2023 LoRA SFT
2024 Mid-Training
2025 RLVR + GRPO
2026 Inference-time scaling?
2027 Continual learning?
Actually I didnβt change any of the earlier sections but just appended the new sections to the article.
Re your LLM idea, I could see it as a benchmark for agentic LLMs though to see if they can get the correct architecture info from the code bases.
Just updated the Big LLM Architecture Comparison article...
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
Based on the naming resemblance, if I had to guess, DeepSeekMoE was motivated by DeepSpeed-MoE (arxiv.org/abs/2201.05596) 14 Jan 2022
Tbh if it took them a month to write and release the paper, the DeepSeekMoE team probably also had the model ready in December.
Or in other words, I don't think they trained the model in just a month with all the ablation studies in that paper.
They don't have a reasoning model, yet. So, it is a bit unfair to compare, but since you asked:
I think Google originally came up with MoE, and DeepSeek and Mixtral adopted it independently of each other.
Eg looking at arxiv, the Mixtral report came out on 8 Jan 2024 (arxiv.org/abs/2401.04088), and DeepSeekMoE around the same time on 11 Jan 2024 (arxiv.org/abs/2401.06066)
Good catch, yes that should have been 70% not 40%. Thanks!
Hold on a sec, Mistral 3 Large uses the DeepSeek V3 architecture, including MLA?
Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.
Yes, good point. I must have accidentally moved the text boxes to the wrong position. Someone mentioned that on the forum last week and it's fixed now (the next time the MEAP is updated, the figures will be automatically replaced. Thanks for mentioning.
Sounds interesting, but as far as I know, it doesn't have GPU support (but maybe they added that and I missed it)
Excited for my first conference in Europe in April. Iβll be talking about LLMs, Python, coding, and all the fun stuff, and Iβm looking forward to meeting fellow AI builders there!
Yes, it's a somewhat scaled-down version of the H100 to make it export-compliant
I think you recently mentioned their alternative, more efficient GPUs. Actually, in their latest V3.2 technical report they mention H800s, so it looks like they are back to using NVIDIA GPUs.