fuckin cool
fuckin cool
Really informative episode with SemiAnalysisβ Dylan Patel: share.snipd.com/episode/add3...
Interesting video about building isochromic maps: youtu.be/rC2VQ-oyDG0?...
All of Randall Munroe's books are GOAT for kids' non-fiction.
Great blog covering the progress this year.
βAsking o1 to complete proofs in creative ways is effectively asking it to be a research colleague. The model doesn't have to get proofs right to be useful, it just has to help us be better researchers.β
Good example of utility that evals fail to capture.
Benchmarks are flawed but a way to trace AI over the last year is GPQA Diamond. This is a Google-proof question set that experts get 81% right in their fields & highly skilled non-experts with 30 minutes per question and Google use get 22%
GPT-4 got 37% at the start of 2024. o1 got 78%. o3 is 87.7%
Tools for your LLM in containers? Yes please! www.docker.com/blog/the-mod...
I wish people would post more links to interesting things
I feel like Twitter and LinkedIn and Instagram and TikTok have pushed a lot of people out of the habit of doing that, by penalizing shared links in the various "algorithms"
Bluesky doesn't have that misfeature, thankfully!
I love this idea, thanks for sharing! Btw, in case you revise these, I noticed a typo
Comparing NotebookLM audio overviews to @elevenlabsio.bsky.socialβs GenFM podcasts: Iβm still blown away by the naturalness of NotebookLMβs conversation, but prefer GenFMβs level of detail, even though itβs a more stilted conversation
This shift from training to inference compute is good news for hyperscalers and Nvidia.
In the ARC AGI eval (linked article in the first post), the βhigh computeβ mode results came from spending ~$350K in total on inference, giving the model more compute to search the solution tree.
These models excel at reasoning-heavy tasks like coding, summarisation, and can work through PhD-level problems with sufficient test time compute. Unlike their predecessors (4o/3.5-sonnet), these reasoning models get βsmarterβ with inference compute.
OpenAI released its 2nd gen reasoning model, o3 (yeah, even they admitted they suck at names).
The evals are perhaps the final nail in the coffin for the scaling wall hypothesis, showing that AI models arenβt hitting a plateau in capabilities.
arcprize.org/blog/oai-o3-...
Lots of apps have had text-to-speech for years, but ElevenLabs voices really stand out to me for naturalness of enunciation. I use it a lot for listening to articles.
elevenlabs.io/blog/introdu...