Sebastian Raschka (rasbt)'s Avatar

Sebastian Raschka (rasbt)

@sebastianraschka.com

ML/AI researcher & former stats professor turned LLM research engineer. Author of "Build a Large Language Model From Scratch" (https://amzn.to/4fqvn0D) & reasoning (https://mng.bz/Nwr7). Also blogging about AI research at magazine.sebastianraschka.com.

10,053
Followers
248
Following
312
Posts
24.04.2023
Joined
Posts Following

Latest posts by Sebastian Raschka (rasbt) @sebastianraschka.com

State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490
State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490 YouTube video by Lex Fridman

Recorded a podcast, think it’s pretty good and comprehensive, hope you like it ;) youtu.be/EV7WhVT270Q?...

31.01.2026 23:06 πŸ‘ 39 πŸ” 4 πŸ’¬ 1 πŸ“Œ 1
Post image

Been a while since I did an LLM architecture post. Just stumbled upon the Arcee AI Trinity Large release + technical report released yesterday and couldn't resist :)

Also added a new section to my LLM architecture comparison article with more details: magazine.sebastianraschka.com/i/168650848/20

29.01.2026 16:36 πŸ‘ 42 πŸ” 4 πŸ’¬ 1 πŸ“Œ 1
Post image

Been pretty heads-down finishing Chapter 6 on implementing RLVR via GRPO. Just finished, and it might be my favorite chapter so far.

Code notebook: github.com/rasbt/reason...

(And it should be added to the early access soon.)

The next chapter adds stability and performance improvements to GRPO.

18.01.2026 14:58 πŸ‘ 40 πŸ” 4 πŸ’¬ 2 πŸ“Œ 0
Post image

For the past month or so, I've been slowly working through this book by @sebastianraschka.com which theoretically and practically builds a GPT model from scratch. Highly recommended!

Ironically, I'm writing much more code by hand as a result

07.01.2026 16:55 πŸ‘ 16 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0

Ha, thanks for the kind compliment!

15.01.2026 14:04 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Ha, thanks! Happy new year to you as well!

31.12.2025 13:54 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Thanks! Is /r/machinelearning still weekend only for unless it's an arxiv article?

30.12.2025 19:29 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
The State Of LLMs 2025: Progress, Progress, and Predictions A 2025 review of large language models, from DeepSeek R1 and RLVR to inference-time scaling, benchmarks, architectures, and predictions for 2026.

Uploaded my State of LLMs 2025 report for this year:
magazine.sebastianraschka.com/p/state-of-l...

I planned to just write a brief overview, but yeah, it was an eventful year so it was impossible to keep it below 7000 words :D.

30.12.2025 16:22 πŸ‘ 88 πŸ” 23 πŸ’¬ 4 πŸ“Œ 3

This is an opinion. That's why I prefaced my post with "I think of it as this"

29.12.2025 15:53 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

One of the underrated papers this year:
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)

(I can confirm this holds for RLVR, too! I have some experiments to share soon.)

29.12.2025 15:52 πŸ‘ 69 πŸ” 10 πŸ’¬ 0 πŸ“Œ 1

I agree. I was thinking of β€œfaster” because it frees time when letting it do boilerplate stuff. And I was thinking of β€œbetter” as in using it to find issues that were accidentally overlooked.

28.12.2025 21:18 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Yeah. My point was that LLMs are good amplifiers, but they are not the only tool one should use and learn from.

28.12.2025 17:06 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

It's a cycle: Coding manually, reading resources written by experts, looking at high-quality projects built by experts, getting advice from experts, and repeat...

28.12.2025 16:17 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

I think of it as this: LLMs lower the barrier of entry, and they make coders (beginners and experts) more productive.
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.

28.12.2025 16:03 πŸ‘ 33 πŸ” 3 πŸ’¬ 4 πŸ“Œ 3
Preview
Understanding Large Language Models A Cross-Section of the Most Relevant Literature To Get Up to Speed

I discuss the more historical building blocks here if you are interested (going back to "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Neural Networks" 1991 by Schmidhuber): magazine.sebastianraschka.com/p/understand...

23.12.2025 15:35 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Yes yes. This is not a complete history.
I assume you are specifically referring to the first line β€œ202x…”? I merely wanted to say that the focus in the early 2020s was more on pre-training than anything else then. (I think the term LLM wasn’t coined until the 175B GPT-3 model came out).

23.12.2025 15:34 πŸ‘ 2 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0

The LLM eras:

202x Pre-training (foundation)
2022 RLHF + PPO
2023 LoRA SFT
2024 Mid-Training
2025 RLVR + GRPO
2026 Inference-time scaling?
2027 Continual learning?

22.12.2025 15:40 πŸ‘ 36 πŸ” 3 πŸ’¬ 1 πŸ“Œ 0

Actually I didn’t change any of the earlier sections but just appended the new sections to the article.
Re your LLM idea, I could see it as a benchmark for agentic LLMs though to see if they can get the correct architecture info from the code bases.

14.12.2025 15:30 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Just updated the Big LLM Architecture Comparison article...
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...

13.12.2025 14:22 πŸ‘ 77 πŸ” 13 πŸ’¬ 1 πŸ“Œ 0

Based on the naming resemblance, if I had to guess, DeepSeekMoE was motivated by DeepSpeed-MoE (arxiv.org/abs/2201.05596) 14 Jan 2022

12.12.2025 21:00 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Tbh if it took them a month to write and release the paper, the DeepSeekMoE team probably also had the model ready in December.
Or in other words, I don't think they trained the model in just a month with all the ablation studies in that paper.

12.12.2025 20:58 πŸ‘ 0 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0
Post image

They don't have a reasoning model, yet. So, it is a bit unfair to compare, but since you asked:

12.12.2025 20:42 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I think Google originally came up with MoE, and DeepSeek and Mixtral adopted it independently of each other.

Eg looking at arxiv, the Mixtral report came out on 8 Jan 2024 (arxiv.org/abs/2401.04088), and DeepSeekMoE around the same time on 11 Jan 2024 (arxiv.org/abs/2401.06066)

12.12.2025 20:34 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Good catch, yes that should have been 70% not 40%. Thanks!

12.12.2025 19:20 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Hold on a sec, Mistral 3 Large uses the DeepSeek V3 architecture, including MLA?

Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.

12.12.2025 19:14 πŸ‘ 33 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0

Yes, good point. I must have accidentally moved the text boxes to the wrong position. Someone mentioned that on the forum last week and it's fixed now (the next time the MEAP is updated, the figures will be automatically replaced. Thanks for mentioning.

06.12.2025 01:11 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Sounds interesting, but as far as I know, it doesn't have GPU support (but maybe they added that and I missed it)

06.12.2025 01:10 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Excited for my first conference in Europe in April. I’ll be talking about LLMs, Python, coding, and all the fun stuff, and I’m looking forward to meeting fellow AI builders there!

05.12.2025 04:21 πŸ‘ 26 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0

Yes, it's a somewhat scaled-down version of the H100 to make it export-compliant

03.12.2025 15:59 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I think you recently mentioned their alternative, more efficient GPUs. Actually, in their latest V3.2 technical report they mention H800s, so it looks like they are back to using NVIDIA GPUs.

03.12.2025 14:53 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0