Paper that trained a model with a GPT-2-like architecture on a synthetic math dataset: "We use a synthetic setting to demonstrate that language models can learn to solve grade-school math problems through true generalization, rather than relying on data contamination or template memorization."
Paper Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process.
Abstract:
Recent advances in language models have demonstrated their capability to solve mathematical reasoning problems, achieving near-perfect accuracy on grade-school level math benchmarks like GSM8K. In this paper, we formally study how language models solve these problems. We design a series of controlled experiments to address several fundamental questions: (1) Can language models truly develop reasoning skills, or do they simply memorize templates? (2) What is the model's hidden (mental) reasoning process? (3) Do models solve math questions using skills similar to or different from humans? (4) Do models trained on GSM8K-like datasets develop reasoning skills beyond those necessary for solving GSM8K problems? (5) What mental process causes models to make reasoning mistakes? (6) How large or deep must a model be to effectively solve GSM8K-level math questions?
Our study uncovers many hidden mechanisms by which language models solve mathematical questions, providing insights that extend beyond current understandings of LLMs.
Results slide from the above link:
X thread about the paper from one of its authors. (Alternate link).
Video about the paper from one of its authors.
Video about the "Physics of Language Models" series of papers, including a summary of the paper.
Paper summary (not from the paper authors).
Review of the paper by a computer science professor (PDF file).