Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?
Full evaluation of all size models.
MATH 401 = 1 Euler Equation + 16 group * 25 problems
- Euler Equation.
- Add & Subtract of two integers within 10.
- Add & Subtract of two integers within 100.
- Add & Subtract of two integers within 1,000.
- Add & Subtract of two integers within 1,000,000,000,000.
- Add & Subtract of two integers within -10~10.
- Add & Subtract of two decimal numbers within -100~100.
- Multiply two integers within 100.
- Multiply two decimal numbers within 10.
- Multiply two integers within 100,000.
- Division of two integers within 100.
- Exponentiation of with integer base within 10 and integer exponent within 2~4.
- Exponentiation of with a decimal number within 10 as the base and a decimal number within 2~4 as the exponent.
- Add, Subtract & Multiply with one integer within 10 and a common irrational number (i.e.
$e$ or$\pi$ ). - Long arithmetic expressions with brackets, involved integers are all within 100 and operators contain add, subtract, multiply, and division.
- Trigonometry functions including
$\sin$ ,$\cos$ , and$\tan$ . Inputs can be in the format of degrees and radians ($\pi$ can also appear in the inputs). - Logarithm of integers within 1000 of different bases:
$2,e,10$ .
If the difference between the decoded number and the target number is less than
We denote decoded number is
If LLM does not decode any number, we consider
If decoded content does not contain any numbers, we consider it a failure. We calculate the non-number ratio based on it.
@misc{yuan2023large,
title={How well do Large Language Models perform in Arithmetic tasks?},
author={Zheng Yuan and Hongyi Yuan and Chuanqi Tan and Wei Wang and Songfang Huang},
year={2023},
eprint={2304.02015},
archivePrefix={arXiv},
primaryClass={cs.CL}
}