GPT-5.4 Pro: 62.75% <-- new (tied #1)
GPT-5.4: 62.75%
Opus 4.6: 52.94%
Gemini 3.1 Pro: 49.02%
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25%
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%
GPT-5.4 Pro: 62.75% <-- new (tied #1)
GPT-5.4: 62.75%
Opus 4.6: 52.94%
Gemini 3.1 Pro: 49.02%
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25%
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%
4/ GPT-5.4 Pro launched this week with impressive headline numbers: 89.3% on BrowseComp, 83.3% on ARC-AGI-2, and native computer use.
On tax? Same ceiling as standard. The gap between Pro and standard narrows to zero when both models are allowed to think deeply.
Updated rankings (strict):
3/ This is the inference scaling story in one chart.
You can either pay 12x more for a model that's better "out of the box", or give the cheaper model a bigger thinking budget and get the same result.
For tax computation, thinking time fully substitutes for model premium. The ceiling is the same.
2/At lower thinking budgets, Pro actually pulls ahead:
Medium thinking: Pro 56.86% vs Standard 49.02%
High thinking: Pro 58.82% vs Standard 56.86%
Ultrathink: Pro 62.75% vs Standard 62.75%
Pro is smarter per token of thought, but give the cheaper model enough thinking time and it catches up
1/ OpenAI just launched GPT-5.4 Pro, their premium model at 12x the API cost of standard GPT-5.4.
$30/M input tokens, $180/M output vs. $2.50/$15.
I ran TaxCalcBench on Pro. The result:
exactly tied with standard GPT-5.4
12x the price, 0% improvement
But the full story is more nuanced:
This is actually genius:
a Chrome extension that allows you to drag across your Google calendar and then paste your free times as perfectly-formatted text!
4/ Updated rankings (strict -- every line must be correct):
GPT-5.4: 56.86% <-- new
Opus 4.6: 52.94%
Gemini 3.1 Pro: 49.02%
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25%
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%
Eight months ago, 32% was SOTA. Now the top model is over 57%
3/ Some context on what just happened:
Medium thinking GPT-5.4 matches Gemini 3.1 Pro's best result (49.02%) exactly. But high thinking adds another 8 points on top.
This is the biggest single-model jump we've seen from a thinking level increase. Compute allocation matters a lot here.
2/ The thinking level gap is enormous:
High: 56.86%
Medium: 49.02%
Low: 31.37%
That's a 25-point spread between low and high. Low thinking GPT-5.4 would rank near the bottom of the leaderboard. High thinking puts it at #1.
1/ The rivalry between OpenAI & Anthropic continues: GPT 5.4 is now the best model in the world at filing taxes (better than Opus 4.6)!
We Just ran TaxCalcBench on GPT-5.4.
56.86% of tax returns computed perfectly.
That's #1 overall: the first model to break 55%, surpassing Claude Opus 4.6:
4/ Updated rankings (strict: every line must be correct):
Opus 4.6: 52.94%
Gemini 3.1 Pro: 49.02% <-- new
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25%
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%
The gap at the top is closing fast. 8 months ago, 32% was SOTA.
3/ An interesting wrinkle on thinking budgets:
Ultrathink: 49.02%
Medium: 49.02%
High: 47.06%
Lobotomized: 37.25%
Low: 35.29%
Medium thinking matches ultrathink exactly. More thinking doesn't always mean better, but some thinking is critical.
The jump from low to medium is +14 points.
2/ What makes this result wild: Gemini 3 Pro scored 36.27% just 4 months ago.
Gemini 3.1 Pro: 49.02%
vs.
Gemini 3 Pro: 36.27%
That's a 13-point jump in a single generation. Google just leapfrogged GPT-5.2 Pro in one move.
1/ We just ran TaxCalcBench on Gemini 3.1 Pro to test how it does filing taxes.
49.02% of tax returns computed perfectly.
That's #2 overall, only 4 points behind Opus 4.6. And it now holds the best "correct by line" score of any model ever tested (88.54%).
Updated leaderboard:
4/ Full updated rankings (using strict scoring where every line must be correct):
Opus 4.6: 52.94%
GPT-5 w/ Search: 41.67%
GPT-5.2 Pro: 41.18%
Sonnet 4.6: 37.25% <-- new
Gemini 3 Pro: 36.27%
Opus 4.5: 36.27%
GPT-5.2: 33.82%
Gemini 2.5 Pro: 32.35%
7 months ago, 32% was SOTA.
3/ Thinking budget matters enormously for tax.
Same model, same prompt, different thinking levels:
Sonnet 4.6 (ultrathink): 37.25%
Sonnet 4.6 (no thinking): 19.61%
Nearly 2x accuracy just from letting the model think longer.
2/ What makes this interesting isn't the raw number: it's what it means for the cost curve.
Sonnet 4.6 (cheaper, faster model): 37.25%
Opus 4.5 (previous generation's most expensive model): 36.27%
A model that costs a fraction of the price just beat last generation's at tax filing.
1/ We just ran TaxCalcBench on Claude Sonnet 4.6.
37.25% of tax returns computed perfectly.
That's a "mid-tier" model outscoring every single flagship model from 6 months ago.
Updated leaderboard:
I started Column Tax in early 2021 and sold it to Aiwyn in late 2025.
Starting a startup was the hardest thing I've ever done.
But knowing certain things makes it easier.
So I wrote down everything I learned through experience that I wish I had known at the start:
Fair, yeah, that sounds right!
3/ But 52.94% still isn't good enough:
The IRS doesn't grade on a curve. One wrong number = rejection, penalty, or audit
The benchmark for this task is 100%. That's what deterministic tax engines (like the one we built at Column Tax) achieve today
AI isn't there yet, but the trajectory is wild
2/ July 2025: We released TaxCalcBench. The best model (Gemini 2.5 Pro) scored 32.35%
Sep 2025: GPT-5 launched with a TCB Score of: 38.24% (better!)
Jan 2026: GPT-5.2 Pro hit 41.18%
Feb 2026: Claude Opus 4.6 jumped to 52.94% !
That's a ~20 point improvement in 7 months
1/ One year ago, no AI model could calculate a single tax return correctly.
Today, Claude Opus 4.6 gets 52.94% right.
Here's the full timeline of how AI went from 0% to halfway to replacing TurboTax:
Thanks to coding agents, everyone (PMs, Designers, etc.) can be a vibe coder. But my biggest question is:
Who is going to be the vibe QA engineer?
โข 1 week ago: Margen publishes a PR release & blog post citing their TaxCalcBench scores
โข 8 months ago: first commit to TaxCalcBench
โข 2 months ago: Filed measures their system on TaxCalcBench and publishes a blog post*
โข 1 month ago: Prime Meridian announces their $3.5 seed round from General Catalyst with a TaxCalcBench perfect score
You Can Just Do Things: 8 months ago I decided to create the first-ever AI eval for Tax filing: TaxCalcBench. Today, it's the industry standard:
Just in the past few months, companies have started citing TaxCalcBench in their marketing, announcements, and blog posts.
I need proof before I make claims, so I can now confidently say: Claude Opus 4.6 is an incredible model. No model has been able to do this before:
Claude Opus 4.6 is now able to exactly correctly compute 52.94% of tax returns in the TaxCalcBench dataset, handily beating GPT-5 w/ Web Search.