The Numeric Hallucination Problem
Language models are trained to predict plausible text, not to perform arithmetic. Ask an LLM to calculate 17% of 3,847 and it will give you a confident, plausible-sounding, and occasionally wrong answer. In a financial agent this is unacceptable.
The Constrained Tool Pattern
Our solution: the LLM never does arithmetic. It calls a calculate(expression) tool that evaluates the expression in Python and returns the result. The LLM is only responsible for identifying which numbers to use and what operation to perform.
def calculate(expression: str) -> float:
# Whitelist arithmetic operators only
import ast
tree = ast.parse(expression, mode='eval')
allowed = (ast.Expression, ast.BinOp, ast.UnaryOp,
ast.Num, ast.Constant, ast.Add, ast.Sub,
ast.Mult, ast.Div, ast.Pow, ast.Mod)
for node in ast.walk(tree):
if not isinstance(node, allowed):
raise ValueError(f"Unsafe expression: {expression}")
return eval(compile(tree, "", "eval"))Why Not Code Interpreter?
Code interpreter works but is slow (2ÔÇô4 second cold start) and expensive. Our calculate() tool responds in under 5 ms. For an agent processing 200 financial queries per hour, this matters.
Results
After adding constrained tools, our financial agent's numeric accuracy went from 91% to 99.7% on our benchmark set. The 0.3% failures are all edge cases involving number formatting (lakhs vs millions), not arithmetic errors.