Are you tracking inference cost per request — or just watching the monthly bill?
Are you tracking inference cost per request — or just watching the monthly bill?
If you're building agent systems, here's the playbook:
↳ Single-responsibility agents with clean interfaces
↳ Design for model tiering from day one
↳ Track cost per request not just monthly spend
That per-request number is what tells you if your architecture survives at scale
Fit the model to the stakes of the decision.
Not every agent needs GPT-4.
► Costs:
3 agents = 3× the inference budget per request.
At Uber's volume, that compounds fast.
Their solution: model tiering.
→ Lighter models on demand forecasting (runs constantly, errors are recoverable)
→ Heavier models on assignment (wrong answer = bad UX immediately)
► Each agent is replaceable independently:
Agents 1 and 2 don't need to be right 100% of the time.
They just need to be right often enough to improve Agent 3's matching quality.
If upstream signals are low-confidence?
Agent 3 falls back to simpler heuristics.
🤖 Agent 3 → Assignment
Bipartite graph matching with ETA prediction.
Agents 1 and 2 feed directly into its cost function.
This is where the ride meets the driver.
Connected through clean function-calling interfaces.
🤖 Agent 2 → Pricing
Real-time market clearing.
Takes the demand signal from Agent 1.
Adjusts surge pricing before the imbalance hits.
🤖 Agent 1 → Demand Forecasting
Time-series ML predicting surge zones 15 minutes ahead.
Not reactive. Predictive.
It doesn't care where riders are. It cares where they're going.
Uber runs 20M+ trips daily.
An 8% ETA improvement isn't a rounding error.
It's the driver's earnings.
Fewer cancellations.
Rider satisfaction.
Here's the system that produced it:
I studied how Uber matches 20M rides per day
(so you don't have to):
What are you reading this week?
Unlock all 10 links + my notes: newsletter.hungryminds.dev
You found this scrolling.
50k+ engineers receive it free every Monday.
9. Build Your AI Agent The Right Way (Most Teams Don't):
►🔒 In today's Hungry Minds issue
8. How Balyasny Built An AI Research Engine That Actually Works For Investing:
►🔒 In today's Hungry Minds issue
7. Apple's New Approach To Catching LLM Hallucinations At The Span Level:
►🔒 In today's Hungry Minds issue
6. How Databricks Uses LLMs To Detect PII At 92% Precision Across Every Log:
►🔒 In today's Hungry Minds issue
5. How A 12-Word GitHub Issue Title Owned 4,000 Developer Machines:
►🔒 In today's Hungry Minds issue
4. The System Design Interview Framework I Wish I Had Before I Failed:
► newsletter.systemdesign.one/p/how-to-pr...
3. Defeating The Deepfake: How Cloudflare Is Stopping Laptop Farms And Insider Threats:
► blog.cloudflare.com/deepfakes-i...
2. The Research-Plan-Implement Workflow That Stops Claude Code From Writing Bad Code:
► boristane.com/blog/how-i-...
1. Google Quantum-Proofs HTTPS By Squeezing 15kB Into 700 Bytes:
► arstechnica.com/security/20...
0. How Discord Added Distributed Tracing To Elixir Without Breaking Anything:
► discord.com/blog/tracin...
You are what you eat.
10 brain foods to grow as an engineer:
What do you think?
The goal isn't to be clever about your stack. The goal is to be predictable.
Latency-sensitive, memory-constrained, high-concurrency → Go or Rust.
Team velocity, broad tooling, API work → TypeScript.
Data at scale → Polars, DuckDB, or a proper inference runtime.
The pattern: match the runtime to the constraint.
🔥 Lightweight scripting and data transforms
Surprising pick: TypeScript (Bun/tsx), not Python or Rust.
TypeScript is fast to write, fast to run with Bun, and you get type safety on your data shapes. Rust is overkill. Python works, but you're already writing TS everywhere anyway
🔥 CLI tools
Default: Python → 300ms+ startup on every invocation, ugly distribution story → Go.
Go produces a single static binary. No venv, no pip install, no "works on my machine." Ship it, run it. Rust is you're familiar with it.
🔥 ML inference serving
Default: plain Python → throughput bottleneck at scale → Python for orchestration, not the hot path.
The actual serving layer should run on an inference runtime. vLLM handles continuous batching and KV cache management that raw Python never will.