There is what you would call a trust crisis
@xowap.dev
CTO/co-founder of WITH, 20+ years of web development, mad genius behind baby-cto.com, pro shitposter. Python, Django, JS/TS, SvelteKit, DevOps, Linux, LLM/GenAI, 3D printing, MR/spatial computing and random takes on society.
There is what you would call a trust crisis
lol I wish _I_ could look in my thought chain, that'd be a progress already
That's definitely my thinking. Lots of small agents talking to each other. Even coding should have 5~10 layers of agents, and probably a bunch of different agents for different tasks at each layer IMHO
Great now Gmail is telling me to make my sentences concise. They better use this for the output of Gemini's thinking that'd save us all some time
Amazon: do you want to receive this in 2 days instead of tomorrow so that we make less trips?
Me: sure, sounds responsible
[2 days later]
Amazon: *does 2 distinct deliveries*
---
ARE YOU KIDDING ME
Clockwise, from top left: Jupiter, Saturn, Neptune, and Uranus. None of the worlds is to scale, but all are imaged with JWST's near-infrared (NIRCam) instrument.
The giant planets of the Solar System, by JWST.
Yeah and that 9B is better than 35B. I guess with less parameters it gets less confused when remembering pre-canned benchmark answers. The whole thing makes no sense. And actually using the model feels like shit...
Gemini 3 Flash had a huge issue (IMHO) with follow-up questions but I see that 3.1 fixed that and I must say it's becoming my favorite model
"I don't know what is wrong with the benchmarks"
Obviously I know what is wrong, and that's the fact that models are fine-tuned on the benchmarks and they don't mean anything anymore
Honestly, I don't know what is wrong with the benchmarks but I'm calling bullshit on this whole game. Running Qwen 3.5 27B (supposedly the smartest of the family?...) costs 400 times more than Ministral 14B for essentially bad results? That's a complete fucking disconnect from reality 🥴
These numbers come from my current app, I've made a bunch of runs of the same stuff with different model to see what the fuck. You can imagine my surprise when I've seen that Opus was almost as fast to complete as Gemini 3.1 Flash
Dear LLM Industry,
Please find herein my official letter of hate against reasoning models.
It's fucking ridiculous, despite its prohibitive price it's 15x (literally!) cheaper to run Opus 4.6 than Gemini 3.1 Pro...
IMHO Ministral 14B beats everyone to the pulp on all metrics
Currently benching a bunch of models together for a specific task, Qwen is just getting lost in its own thoughts it's a disaster... For answers that aren't even that good 😅
To compare Qwen 3.5 27B with GPT OSS 20B, I much prefer the later (which is also MUCH faster on Ollama)
Nice!
Is anyone actually using Qwen? They score well in benchmarks but when ti's about doing something useful the outcome is nevery really satisfying (for me)?
That's way too much excitement over a n8n flow which suggests a domain name 🤣
Which also raises the question: if that is true, does that mean that you can reach an absolute truth on any topic, provided the right thought process?
Can you do the same to humans?
If somehow you can get LLMs to work in a way where you systematically eliminate doubt, it means that indeed you could get deterministic outputs despite LLMs being absolutely random at their core
Being sure when nothing is sure: that's what Information Theory manages to do. Transform a messy radio soup into a reliable 1 Gbps wifi link.
Now in terms of LLMs... Does that apply?
Just made a n8n-based algorithm which, out of an open questions, converges run after run. That's super promising!
Interesting that energy companies can't cut you off without months of procedures despite extremely high marginal costs while subscription services will block you the second your card gets declined for any reason despite near-zero marginal cost
can't you say that the model has a thinking mode therefor they am?
This doesn't apply if said API was created after 2010
When a service has a XML API you don't know if you should boo them because who still uses XML or respect them because they had an API before everyone else had an API
3D printing times make no sense. You can print dozens of pieces in 15 minutes but then print one thing in 20h.
It's like the John Wick currency, but reversed
She does swear a lot more IRL, though 🤣
A good reminder that, on top of not letting your AI do whatever the fuck it wants, you should have safeguards against mistakes and recovery plans that you can't fuck up the same way as the rest
Terraform + Infracost + Gemini CLI = ❤️
As Yann LeCun said, if LLMs don't know the consequences of their actions they can't do anything, but in the case fo cloud infrastructure there are tools to simulate that. Days of work done in minutes, I'm happy.
Ministral 3 est vraiment sous-côté. GPT-OSS aussi, mais un peu plus gros (mais très très rapide)
Y'a généralement pas grand chose à tirer de cette taille de modèle dans un assistant de code TBH
I'm saying this in the sense that WordPress was an amazing tool on its own when it was released and allowed countless people (including me) to create a website in a flexible way. But it has now been surpassed by other tools on every single metric (except usage, ofc).