Michael R. Bock (@michaelrbock.com)

GPT-5.4 Pro: 62.75% <-- new (tied #1)
GPT-5.4: 62.75%
Opus 4.6: 52.94%
Gemini 3.1 Pro: 49.02%
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25%
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%

10.03.2026 13:46 👍 0 🔁 0 💬 0 📌 0

4/ GPT-5.4 Pro launched this week with impressive headline numbers: 89.3% on BrowseComp, 83.3% on ARC-AGI-2, and native computer use.

On tax? Same ceiling as standard. The gap between Pro and standard narrows to zero when both models are allowed to think deeply.

Updated rankings (strict):

10.03.2026 13:46 👍 0 🔁 0 💬 1 📌 0

3/ This is the inference scaling story in one chart.

You can either pay 12x more for a model that's better "out of the box", or give the cheaper model a bigger thinking budget and get the same result.

For tax computation, thinking time fully substitutes for model premium. The ceiling is the same.

10.03.2026 13:46 👍 1 🔁 0 💬 2 📌 0

2/At lower thinking budgets, Pro actually pulls ahead:

Medium thinking: Pro 56.86% vs Standard 49.02%

High thinking: Pro 58.82% vs Standard 56.86%

Ultrathink: Pro 62.75% vs Standard 62.75%

Pro is smarter per token of thought, but give the cheaper model enough thinking time and it catches up

10.03.2026 13:46 👍 0 🔁 0 💬 1 📌 0

1/ OpenAI just launched GPT-5.4 Pro, their premium model at 12x the API cost of standard GPT-5.4.

$30/M input tokens, $180/M output vs. $2.50/$15.

I ran TaxCalcBench on Pro. The result:

exactly tied with standard GPT-5.4

12x the price, 0% improvement

But the full story is more nuanced:

10.03.2026 13:46 👍 1 🔁 0 💬 1 📌 0

This is actually genius:

a Chrome extension that allows you to drag across your Google calendar and then paste your free times as perfectly-formatted text!

09.03.2026 13:46 👍 0 🔁 0 💬 0 📌 0

4/ Updated rankings (strict -- every line must be correct):

GPT-5.4: 56.86% <-- new
Opus 4.6: 52.94%
Gemini 3.1 Pro: 49.02%
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25%
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%

Eight months ago, 32% was SOTA. Now the top model is over 57%

06.03.2026 14:46 👍 0 🔁 0 💬 0 📌 0

3/ Some context on what just happened:

Medium thinking GPT-5.4 matches Gemini 3.1 Pro's best result (49.02%) exactly. But high thinking adds another 8 points on top.

This is the biggest single-model jump we've seen from a thinking level increase. Compute allocation matters a lot here.

06.03.2026 14:46 👍 2 🔁 0 💬 1 📌 0

2/ The thinking level gap is enormous:

High: 56.86%
Medium: 49.02%
Low: 31.37%

That's a 25-point spread between low and high. Low thinking GPT-5.4 would rank near the bottom of the leaderboard. High thinking puts it at #1.

06.03.2026 14:46 👍 0 🔁 0 💬 1 📌 0

1/ The rivalry between OpenAI & Anthropic continues: GPT 5.4 is now the best model in the world at filing taxes (better than Opus 4.6)!

We Just ran TaxCalcBench on GPT-5.4.

56.86% of tax returns computed perfectly.

That's #1 overall: the first model to break 55%, surpassing Claude Opus 4.6:

06.03.2026 14:46 👍 0 🔁 0 💬 1 📌 0

4/ Updated rankings (strict: every line must be correct):

Opus 4.6: 52.94%
Gemini 3.1 Pro: 49.02% <-- new
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25%
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%

The gap at the top is closing fast. 8 months ago, 32% was SOTA.

03.03.2026 14:44 👍 0 🔁 0 💬 0 📌 0

3/ An interesting wrinkle on thinking budgets:

Ultrathink: 49.02%
Medium: 49.02%
High: 47.06%
Lobotomized: 37.25%
Low: 35.29%

Medium thinking matches ultrathink exactly. More thinking doesn't always mean better, but some thinking is critical.

The jump from low to medium is +14 points.

03.03.2026 14:44 👍 0 🔁 0 💬 1 📌 0

2/ What makes this result wild: Gemini 3 Pro scored 36.27% just 4 months ago.

Gemini 3.1 Pro: 49.02%
vs.
Gemini 3 Pro: 36.27%

That's a 13-point jump in a single generation. Google just leapfrogged GPT-5.2 Pro in one move.

03.03.2026 14:44 👍 0 🔁 0 💬 1 📌 0

1/ We just ran TaxCalcBench on Gemini 3.1 Pro to test how it does filing taxes.

49.02% of tax returns computed perfectly.

That's #2 overall, only 4 points behind Opus 4.6. And it now holds the best "correct by line" score of any model ever tested (88.54%).

Updated leaderboard:

03.03.2026 14:44 👍 0 🔁 0 💬 1 📌 0

4/ Full updated rankings (using strict scoring where every line must be correct):

Opus 4.6: 52.94%
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25% <-- new
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%
GPT-5.2: 33.82%
Gemini 2.5 Pro: 32.35%

7 months ago, 32% was SOTA.

20.02.2026 14:46 👍 1 🔁 0 💬 0 📌 0

3/ Thinking budget matters enormously for tax.

Same model, same prompt, different thinking levels:

Sonnet 4.6 (ultrathink): 37.25%

Sonnet 4.6 (no thinking): 19.61%

Nearly 2x accuracy just from letting the model think longer.

20.02.2026 14:46 👍 0 🔁 0 💬 1 📌 0

2/ What makes this interesting isn't the raw number: it's what it means for the cost curve.

Sonnet 4.6 (cheaper, faster model): 37.25%

Opus 4.5 (previous generation's most expensive model): 36.27%

A model that costs a fraction of the price just beat last generation's at tax filing.

20.02.2026 14:46 👍 0 🔁 0 💬 1 📌 0

1/ We just ran TaxCalcBench on Claude Sonnet 4.6.

37.25% of tax returns computed perfectly.

That's a "mid-tier" model outscoring every single flagship model from 6 months ago.

Updated leaderboard:

20.02.2026 14:46 👍 1 🔁 0 💬 1 📌 0

53 lessons I wish I knew before I started my first startup Starting a startup was the hardest thing I've ever done. But knowing certain things makes it easier. Here's what I learned through experience that I wish I had known at the start

michaelrbock.com/lessons

19.02.2026 14:59 👍 1 🔁 0 💬 0 📌 0

I started Column Tax in early 2021 and sold it to Aiwyn in late 2025.

Starting a startup was the hardest thing I've ever done.

But knowing certain things makes it easier.

So I wrote down everything I learned through experience that I wish I had known at the start:

19.02.2026 14:59 👍 2 🔁 0 💬 1 📌 0

Fair, yeah, that sounds right!

18.02.2026 15:59 👍 1 🔁 0 💬 0 📌 0

3/ But 52.94% still isn't good enough:

The IRS doesn't grade on a curve. One wrong number = rejection, penalty, or audit

The benchmark for this task is 100%. That's what deterministic tax engines (like the one we built at Column Tax) achieve today

AI isn't there yet, but the trajectory is wild

18.02.2026 14:53 👍 2 🔁 0 💬 1 📌 0

2/ July 2025: We released TaxCalcBench. The best model (Gemini 2.5 Pro) scored 32.35%

Sep 2025: GPT-5 launched with a TCB Score of: 38.24% (better!)

Jan 2026: GPT-5.2 Pro hit 41.18%

Feb 2026: Claude Opus 4.6 jumped to 52.94% !

That's a ~20 point improvement in 7 months

18.02.2026 14:53 👍 1 🔁 0 💬 1 📌 0

1/ One year ago, no AI model could calculate a single tax return correctly.

Today, Claude Opus 4.6 gets 52.94% right.

Here's the full timeline of how AI went from 0% to halfway to replacing TurboTax:

18.02.2026 14:53 👍 2 🔁 0 💬 1 📌 0

Thanks to coding agents, everyone (PMs, Designers, etc.) can be a vibe coder. But my biggest question is:

Who is going to be the vibe QA engineer?

16.02.2026 15:11 👍 2 🔁 0 💬 1 📌 0

• 1 week ago: Margen publishes a PR release & blog post citing their TaxCalcBench scores

11.02.2026 15:01 👍 0 🔁 0 💬 0 📌 0

• 8 months ago: first commit to TaxCalcBench
• 2 months ago: Filed measures their system on TaxCalcBench and publishes a blog post*
• 1 month ago: Prime Meridian announces their $3.5 seed round from General Catalyst with a TaxCalcBench perfect score

11.02.2026 15:01 👍 0 🔁 0 💬 1 📌 0

You Can Just Do Things: 8 months ago I decided to create the first-ever AI eval for Tax filing: TaxCalcBench. Today, it's the industry standard:

Just in the past few months, companies have started citing TaxCalcBench in their marketing, announcements, and blog posts.

11.02.2026 15:01 👍 0 🔁 0 💬 1 📌 0

GitHub - column-tax/tax-calc-bench: Code & data for TaxCalcBench Code & data for TaxCalcBench. Contribute to column-tax/tax-calc-bench development by creating an account on GitHub.

See the full results here: github.com/column-tax/...

10.02.2026 15:01 👍 0 🔁 0 💬 0 📌 0

I need proof before I make claims, so I can now confidently say: Claude Opus 4.6 is an incredible model. No model has been able to do this before:

Claude Opus 4.6 is now able to exactly correctly compute 52.94% of tax returns in the TaxCalcBench dataset, handily beating GPT-5 w/ Web Search.

10.02.2026 15:01 👍 2 🔁 0 💬 1 📌 0

Michael R. Bock

Latest posts by Michael R. Bock @michaelrbock.com