LLMs Are Guessing at Math TrueMath

LLMs Are Guessing at Math (And That’s a Problem)

Let’s be clear: the tech is so cool! I ask a question and it gives me an answer using a wealth of knowledge from all over the web. It synthesizes information that would have taken forever to track down, helps me organize random streams of thought, and has helped me write code I never could have done without it. I started wondering how AI math would fair against my 30 years’ experience in building calculation software.

I started using ChatGPT 3 when it launched, and of course I tried an AI math problem. Something benign like mortgages. 

In the first versions ChatGPT calculated wrong. By ChatGPT 4 it did the math correctly but mangled the problem. I told it, “calculate a payment for a $500,000 home at 6% interest” and it treated $500k as a loan amount.

So I’d say, “wait, that’s a home price, not a loan amount” and ChatGPT would either say it assumed 0% down (what a deal!) or it corrected itself. Either way, it didn’t outline its assumptions, didn’t explain its math, and on the next problem, used rounded results, which caused drift.

(Don’t believe me? Try it. Or even better: ask it if you should rely on it for math problems. It will say no. Hey, pretty smart!)

I’ve been building calculators for nearly 30 years. PowerOne has been downloaded over 30 million times. I’ve shipped calculation engines on Palm, Windows Mobile, BlackBerry, Windows, MacOS, iOS, Android, and the web. And if my customers got these answers, they would have quit using PowerOne.

Here’s what most people don’t understand about LLMs and math: they’re not calculating, they’re predicting. Token by token, they’re guessing what the right-looking answer might be.

That’s fine for copywriting but disastrous for anything that needs to be verifiably correct.

Here’s another: I ran the same compound interest calculation through GPT-4.5, Claude 3.5, and Gemini, five times each. The answers ranged from 17% to 50%. Same inputs. Same formula. Wildly different outputs.

ItemTrueMath (YAML)TrueMath (LLM)ChatGPT 5.1 ThinkingClaude Opus 4.5Gemini Thinking with 3 Pro
Time to Complete (secs)Ave:
0.272Range:
0.23 – 0.32
Ave:
2.88Range:
2.6 – 3.3
Ave:
163.8Range:
102 – 231
Ave:
52.8Range:
15 – 72
Ave:
94.2Range:
85 – 108
AnswerAve: 16.384%Range: 16.384% – 16.384%Ave: 16.384%Range: 16.384% – 16.384%Ave:
27%Range:
17% – 35% 
Ave:
42.07%Range:
34.4% – 49.86%
Ave:
34.38%Range:
34.37% – 34.4%

If you’re building software where the math matters this isn’t a minor inconvenience. It’s a dealbreaker in industries like fintech, property tech, construction, medical tech. Basically anywhere compliance and trust matter.

Can you just wait it out and hope LLMs get better? Sure, you can, but hope isn’t a plan. They may get better but the reality is that LLMs are probabilistic systems, not deterministic. The answers will always vary. That’s what makes LLMs so amazing! Hallucination is a feature! 

But let’s say it does get better and, say, can calculate accurately 95% of the time. In math, that’s wrong 1 out of every 20 times. I’d never trust Excel if it did that.

That’s why Bill Kelly and I started building TrueMath. Not to compete with LLMs — but to give them a backbone they can rely on.

More on that next time.

Reach out: elia.freedman@truemath.ai
Learn more: truemath.ai
Sign up for early access: https://app.truemath.ai/signup 

Stay Informed with the TrueMath newsletter!

Get occasional updates on our mission to make AI trustworthy through reliable math — including new blog posts, product updates, and insights on building deterministic infrastructure.

We don’t spam! Read our privacy policy for more info.

Similar Posts