Why Your ChatGPT Finance Workflow Keeps Breaking
Jamie Saveall ·
Estimated reading time: 7 min
You paste the management accounts into ChatGPT. You ask for board commentary. It comes back in thirty seconds, well-written, in your tone of voice. You scan it. It looks right.
It probably isn't.
Independent benchmarks of large language models on financial reasoning tasks land somewhere in the 60–70% accuracy range, depending on the task. Roughly a third of the numbers a pure-LLM workflow produces are wrong, missing, or invented. Not subtly wrong. Often confidently, fluently, board-readably wrong.
Most finance people who've tried this discover the problem the awkward way. A figure cited in a meeting that doesn't tie back. A trend "explained" with reasoning that contradicts the data. A revenue split that adds up to 103%. The model writes with such fluency that the error is invisible until someone with the underlying spreadsheet open catches it.
This is not a prompt-engineering problem. You can't fix it with better instructions. The structure is the problem.
Why language models can't add
LLMs don't reason. They predict the next token in a sequence, where a token is a fragment of text. Sometimes a word, sometimes a few characters. The number 1,247,832 isn't stored as a number. It's stored as several tokens the model has seen co-occur with other token sequences in training.
When you ask ChatGPT to compute (€1,247,832 × 1.04) + €312,540, it isn't doing arithmetic. It's predicting what the answer probably looks like, based on patterns in its training data. For small numbers in common contexts, those predictions are close enough most of the time. For larger figures, unusual combinations, or anything requiring multi-step computation, the predictions drift.
This is also why ChatGPT can confidently produce a percentage variance that's mathematically impossible. The token sequence "the quarter saw a 142% gross margin" is more grammatical than it is wrong, and the model has no internal check for whether 142% is a number a margin can take. It has no concept of margin. It has a concept of what the word looks like next to certain other words.
You'll get away with it most of the time. Then one day you won't, in front of a board.
The newer models with built-in code interpreters paper over the worst of this. Ask GPT-4 or Claude to "use Python" and it will spin up a sandbox, run the calculation, and return a result that's actually computed. That helps with simple arithmetic. It doesn't help with the deeper problem. The model still has to decide which numbers to compute, which formulas to apply, and how to interpret the output. Each of those decisions is itself a token prediction. And it's still happy to invent a comparison year, miscount the rows in your input, or quietly drop a segment that didn't fit the pattern it was generating.
The pattern that actually works
The teams getting real value from AI in finance have stopped asking the model to do arithmetic. They've moved to what's becoming known as the final-mile LLM pattern.
The architecture is simple. Compute the numbers deterministically (in code, in a spreadsheet, in a database). Only at the very end, hand the computed numbers to the model and ask it to write the commentary, the executive summary, or the variance explanation.
The model never touches the maths. It only touches the words.
This sounds obvious. It isn't what most ChatGPT-in-finance workflows do. Most workflows hand the model raw data, a CSV or a P&L or a trial balance, and trust it to compute and narrate in one shot. That's the workflow that runs in the 60–70% range.
The final-mile pattern moves the accuracy ceiling to whatever your computation layer is. If your formulas are right, the numbers in the output are right. The model writes the story around them, not under them. It's the same principle behind why an SME finance team doesn't need an ML team to deploy AI — the heavy lifting belongs in code, not in a prompt.
Four rules that hold up under audit
If you're going to put AI anywhere near financial reporting, the rules below are the difference between a tool that earns its keep and a liability you have to babysit.
1. The model never computes a number that goes into a deliverable. Every figure in a board pack, an investor update, or a budget variance commentary must come from a deterministic source. Your GL, your model, your validated spreadsheet. The LLM's role is to describe figures, not produce them. If you can't draw a clean line from any number in the output back to a cell, a SQL query, or a function in code, the number is suspect.
2. The model only sees structured, validated inputs. Don't paste raw exports. Don't dump a 40-tab workbook. Build a clean, computed set of figures (revenue by segment, GM by product line, the variances, the prior-period comparisons) and pass that across. The cleaner the input, the smaller the surface area for hallucination. Models confabulate when they're asked to make sense of mess. Don't give them mess.
3. Every output is reconciled to source. Whatever the model produces, you (or your tool) verify the figures it cites still match the source. If a number appears in the commentary that wasn't in the input, you've caught a hallucination before it hits the board. Reconciliation can be automated. It should be. A human-in-the-loop check that depends on someone remembering to spot-check is a check that fails the first quarter someone's busy.
4. The model is contained to the narrative layer. Computation, validation, and reconciliation are done by code. The model writes the explanation. That separation is the entire game.
Anyone who tells you "but with a better prompt…" hasn't worked in finance. You can't prompt your way past tokenisation. You can constrain, instruct, structure, and guide the model all you want. It will still occasionally generate "the quarter saw revenue of €4.2M" when the input said €4.7M, because at the level of token probabilities, those two strings are close neighbours.
When ChatGPT is genuinely useful for finance
There is a real role for general-purpose LLMs in finance, and it isn't the one most people are using them for.
ChatGPT is excellent for drafting an explanation of a tricky technical concept (revenue recognition under a particular IFRS treatment, say), summarising a long document you already know is correct, rewriting your dense first-draft commentary into something a non-finance reader will understand, or brainstorming what questions a board might ask about a set of numbers you've already validated. It's also fine for searching internally for "what's the right way to phrase a covenant breach disclosure" or "give me three openings for a memo on a margin compression". It's a writing aid.
It's terrible at anything requiring arithmetic. It's terrible at looking at a column of numbers and telling you what's true. It's terrible at any task where you can't easily check the answer afterwards.
The shorthand: use it for language, not for numbers. The moment a number's accuracy depends on the model getting it right, you've moved back into that 60–70% danger zone.
This is also why "just paste your trial balance into ChatGPT" has become the most dangerous advice in the finance LinkedIn-sphere. It produces output that looks expert. It produces wrong answers with the same confidence as right ones. And it's one screenshot away from a credibility hit you don't get back.
What this means for the platforms you'll actually trust
The reason Stratavor is built the way it is, and why the engineering investment over the past eighteen months has been so heavy, comes down to this exact problem.
The Stratavor architecture computes everything deterministically first. KPIs, variances, peer benchmarks, statistical trend significance, ratio analysis. All calculated in a canonical engine, in code, against a validated data model. The figures are what they are. They're reproducible, auditable, and traceable to source.
Only after the computation is done does the language model see anything. It receives the computed figures, the trends, the benchmarks, and writes the commentary. It writes well, because that's what models are good at. It never gets to invent a number.
That's why an output from Stratavor reconciles. The figures in the narrative match the figures in the underlying tables, every time. Not because the prompt is clever. Because the model was never asked to do the maths.
The same architecture is what makes the Power BI integration work. The numbers in your dashboards stay the numbers. The commentary above them is generated, but the figures themselves are the ones you'd compute by hand if you had the time. No drift. No invented segments. No quiet hallucinations of a region you don't operate in.
Most of the AI-in-finance market is still arguing about prompts. The teams that have skipped to the actual answer are the ones building the deterministic spine first and treating the model as a writer, not a calculator.
If your current workflow involves pasting numbers into a chat window and trusting the output, you already know what's coming. It's a question of when, not if.
Better to fix the architecture before the board meeting where the variance was always going to be 142%.
FAQ
Why does ChatGPT get financial calculations wrong?
ChatGPT and other large language models don't perform arithmetic. They predict the next token (a fragment of text) based on patterns in their training data. The number 1,247,832 isn't stored as a number — it's stored as several tokens the model has seen alongside other token sequences. Independent benchmarks of LLMs on financial reasoning tasks land in the 60–70% accuracy range, which means roughly one in three figures in a pure-LLM workflow is wrong, missing, or invented. The errors are usually fluent and confidently presented, which is why they slip past a quick read.
Can better prompt engineering fix LLM accuracy on finance numbers?
No. The error mode is structural, not instructional. At the level of token probabilities, "€4.2M" and "€4.7M" are close neighbours, and you cannot constrain that away with prompt rules. Built-in code interpreters help with simple arithmetic, but the model still has to decide which numbers to compute and which formulas to apply, and each of those decisions is itself a token prediction. The fix is architectural: compute deterministically, then let the model narrate.
What is the final-mile LLM pattern in finance?
The final-mile LLM pattern means computing every number deterministically — in code, in a spreadsheet, or in a database — and only at the very end handing those validated figures to the language model to write commentary, executive summaries, or variance explanations. The model never touches the maths; it only writes the words. The accuracy ceiling becomes whatever your computation layer is, not the model's token-prediction capability.
When is ChatGPT actually safe to use for finance work?
ChatGPT is excellent for language tasks: drafting an explanation of a technical concept like an IFRS treatment, summarising a long document you've already validated, rewriting dense first-draft commentary for a non-finance reader, or brainstorming the questions a board might ask about numbers you've already validated. Use it for language, not for numbers. The moment a number's accuracy depends on the model getting it right, you're back in the 60–70% danger zone.
How do you reconcile AI-generated finance output to source?
Every figure in the output should be traceable to a cell in a spreadsheet, a row in a table, a SQL query, or a function in code. If a number appears in the AI-generated commentary that wasn't in the structured input, you've caught a hallucination. Reconciliation can and should be automated — a check that depends on someone remembering to spot-check is a check that fails the first quarter someone's busy.
What are the rules for using AI in board reporting?
Four rules. First, the model never computes a number that goes into a deliverable — every figure comes from a deterministic source. Second, the model only sees structured, validated inputs (not raw exports or 40-tab workbooks). Third, every output is reconciled to source so hallucinations are caught before they hit the board. Fourth, the model is contained to the narrative layer — computation, validation, and reconciliation are done by code, and the model only writes the explanation.
If your finance team is still pasting figures into a chat window and hoping for the best, book a 20-minute demo to see what a final-mile architecture produces against your own numbers. Or start a free trial and have a reconciled board pack inside your first close.