Agents That Don't Hallucinate Numbers

The Numeric Hallucination Problem

Language models are trained to predict plausible text, not to perform arithmetic. Ask an LLM to calculate 17% of 3,847 and it will give you a confident, plausible-sounding, and occasionally wrong answer. In a financial agent this is unacceptable.

The Constrained Tool Pattern

Our solution: the LLM never does arithmetic. It calls a calculate(expression) tool that evaluates the expression in Python and returns the result. The LLM is only responsible for identifying which numbers to use and what operation to perform.

def calculate(expression: str) -> float:
    # Whitelist arithmetic operators only
    import ast
    tree = ast.parse(expression, mode='eval')
    allowed = (ast.Expression, ast.BinOp, ast.UnaryOp,
               ast.Num, ast.Constant, ast.Add, ast.Sub,
               ast.Mult, ast.Div, ast.Pow, ast.Mod)
    for node in ast.walk(tree):
        if not isinstance(node, allowed):
            raise ValueError(f"Unsafe expression: {expression}")
    return eval(compile(tree, "", "eval"))

Why Not Code Interpreter?

Code interpreter works but is slow (2ÔÇô4 second cold start) and expensive. Our calculate() tool responds in under 5 ms. For an agent processing 200 financial queries per hour, this matters.

Results

After adding constrained tools, our financial agent's numeric accuracy went from 91% to 99.7% on our benchmark set. The 0.3% failures are all edge cases involving number formatting (lakhs vs millions), not arithmetic errors.