For things like this (extracting precise computed data from unstructured blobs) I find that it's often more effective to ask your AI tool to provide a program (I usually ask for a HTML page with a JS form, or a bookmarklet) that can do the actual math.
Otherwise you're just as likely to be getting hallucinated answers based on the AI model's existing biases and training (if it's an American model, it might start telling you the sentenced convicts are young male and non-white even without looking at the data on the page).
I have noticed AI getting much better at correctly doing math than it used to be. Not perfect, but nearly so, and a far cry from this being required for most simple math calculations. (My experience is largely with Claude.)
I've heard that there's been experimenting around giving thinking models access to tools inside of the "thinking" part, so that e.g. calculations could use a Python interpretor, which would give the illusion that the model did the math correctly.
Not sure if it's just OpenAI or if Anthropic has tried this too.
They have that, but I have been reading that new models are so good at math already (solving complex math problems) I am guessing it's generally not needed?
There is a large conceptual gap though, between "solving a complex math problem" which is navigating through logic/reasoning, versus "correctly predicting the next token in the multiplication of 2 or more large numbers".
Eg. If we've already worked out premises that A is larger than B, and 2C is smaller than B, than you can easily compute the next token in the sentence "Therefore C is..."
versus computing 123,287,211 times 971,222, where computing the first token is non-trivial, but computing what comes after "11973" in the result is even less obvious. (it would be tremendously easier if you were predicting the result backwards, starting with the last digit).
There is some evidence that models actually "plan ahead" somewhat (something like guessing more than one token at a time, eg. when writing a line in a poem, the model has an "idea" of what the ending word will be) but there are limits to the reliability of that, vs. using a calculator tool.
Median age 27 Male: 98% Female: 2% Youngest: 19 Oldest: 64 Median prison time: 12 months Median probation time: 24 months
All of the above were Copilot inferred. Don't know why I need to know the above, though. :-)