Oh, hello! May I meet you?
@heikohotz
AI Engineer @ Google ๐จโ๐ป โ Educator ๐จโ๐ซ โ Traveller โ๏ธ โ Hobby photographer ๐ท โ Foodie ๐ฎ โ Film fan ๐ฟ โ Boardgamer ๐ฒ โ Londoner๐โโ๏ธ Medium: https://heiko-hotz.medium.com/ Github: https://github.com/heiko-hotz LI: https://www.linkedin.com/in/heikohotz/
Oh, hello! May I meet you?
I really like tiny (you could even say "nano") bananas. They are so full of flavour ๐
Read more about it here: deepmind.google/discover/blo...
This year, the advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions โ all within the 4.5-hour competition time limit.
At IMO 2024, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs.
This achievement is a significant advance over last yearโs result.
An advanced version was able to solve 5 out of 6 problems.
Official results are in - Gemini achieved gold-medal level in the International Mathematical Olympiad! ๐
The result? Saves tons of time, money, and builds super reliable voice assistants that have undergone a rigorous evaluation process. No more guesswork! ๐
Full details + code here: towardsdatascience.com/let-ai-tune-...
GOOD NEWS: I built an #AutomatedPromptEngineering (APE) pipeline specifically for voice AI! ๐คโจ My new @towardsdatascience blog post dives deep.
What it does:
โ
Creates diverse audio tests
โ
Automates performance eval
โ
LLM optimizes your prompts! ๐
Evaluating voice-driven agents got you pulling your hair out? ๐ฉ Evaluating voice agents is WILD. Accents, noise, weird speech... how do you even test?! Manual prompt engineering for that? A total nightmare. ๐
Thanks for sharing @towardsdatascience.com ๐ค
Building a v1 GenAI app on an existing platform while overhauling the foundation for a better 'V2' is a common strategy. But explaining this to everyday consumers is challenging. These kinds of interviews really help communicate that effectively. What did you think?
To give an example, right out of the gate she asks, "๐๐๐จ๐ฉ ๐ฎ๐๐๐ง ๐ฎ๐ค๐ช ๐๐ฃ๐ฃ๐ค๐ช๐ฃ๐๐๐ ๐ ๐จ๐ข๐๐ง๐ฉ๐๐ง ๐ผ๐-๐๐ง๐๐ซ๐๐ฃ ๐๐๐ง๐. ๐๐๐๐ง๐ ๐๐จ ๐จ๐๐?"
From a developer's point of view, Apple's answers made a lot of sense: a 'V1' worked, but didn't meet their high quality/reliability standards when users went 'off the beaten path'.
This year, however, Craig Federighi and Greg Joswiak were interviewed by other outlets, including Tom's Guide, TechRadar, and The Wall Street Journal. I particularly liked Joanna Stern's interview and her style: direct, concise, and challenging.
While it's not fair to characterise Gruber as an "Apple fanboy," I consistently found his questions too long-winded and too softball. By the end, it often felt (to me, at least) like just a few folk were a bit too cosy on stage.
WWDC interviews with Apple executives just got a facelift - and it is refreshing!
For years, high-level Apple execs would come to John Gruber's (from Daring Fireball) Talk Show at WWDC. I often found these interviews less than insightful, and sometimes even annoying.
i definitely hear you on that one ๐ out of curiosity - what are the benefits you are looking to gain from an agent framework (in general)?
Introducing Gemini-Powered Slide Creation by Voice!
In this quick demo, Iโve integrated a โSlide Creation Agentโ into my personal project, Project Pastra. Watch how it effortlessly generates slides based on voice instructions.
Not perfect by any means, but much better already than "traditional" voice assistants, and we are only at the beginning of this journey.
You can try it yourself with the Developer Guide for Gemini's Multimodal Live API ๐ค
github.com/heiko-hotz/g...
I believe that multimodal AI models have the potential to change that. They allow me to speak much more freely about what I want them to do and oftentimes they understand and execute in the way I expected them to.
But soon I realised that these voice assistants still require a rigid syntax: I would have to phrase commands in a very specific way for the voice assistant to understand what I meant.
To me it was a magical moment when I got my first Amazon Echo in 2015 and could just shout words into the air and got a response.
Multimodal AI models have the potential to finally deliver on the dream of language being the ultimate human-computer interface ๐๏ธ
youtu.be/0OEDHAjY6LM
Fade Out. Directed by Jason Zada. Created with Googleโs Veo 2.
youtu.be/9yQXkdA3u8k?...
Check it out and start building your own voice assistant ๐ค
github.com/heiko-hotz/g...
It is a full-featured web application for real-time conversations with audio and video input, memory, and tool use! And it works great on mobile phones, too.
But Gemini's Multimodal Live API actually lets you build a comparable experience today! I'm proud to share a step-by-step developer guide that will help you build Project Pastra.