Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Code, model input/output and cached evaluation results for our ACL-23 paper "Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters" by Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer and Huan Sun.

Overview

While Chain-of-Thought (CoT) prompting can improve reasoning in large LMs, there is little understanding of what makes it effective. We perform a series of ablation studies on two representive benchmarks where CoT brings large improvements, which reveal the impact of different aspects of CoT demonstrations. We find that

CoT reasoning is possible with invalid demonstrations - prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using CoT under various metrics, while still generating coherent lines of reasoning during inference.
Other aspects of the rationales, such as being relevant to the query and correctly ordering the reasoning steps, are much more important for effective CoT reasoning.

Overall, these findings open up new questions regarding LLMs' capability to learn to reason in context, and reflections on benchmarking few-shot reasoning.

Citation

If you find our code or paper useful, please cite the paper:

@inproceedings{wang2023towards,
  title={Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters},
  author={Wang, Boshi and Min, Sewon and Deng, Xiang and Shen, Jiaming and Wu, You and Zettlemoyer, Luke and Sun, Huan},
  booktitle={The 61st Annual Meeting of the Association for Computational Linguistics},
  year={2023}
}

Repo Tour

.
├── grade-school-math/                       # GSM8K dataset, from https://github.com/openai/grade-school-math
├── indices_800.json                         # Indices for the 800 GSM8K test examples used for evaluation 
├── Bamboogle Prerelease - Sheet1.csv        # Bamboogle dataset, from https://github.com/ofirpress/self-ask
├── Bamboogle Prerelease - Sheet1_inter.csv  # Annotated intermediate bridging entities for Bamboogle
├── utils.py                                 # Helper functions
├── prompts_*/                               # Full prompts for all settings in our experiments
├── main_*.py                                # Scripts for getting model predictions via OpenAI API
├── eval_*.ipynb                             # Evaluation scripts, including cached evaluation results
└── result_*/                                # Cached model prediction results

Usage

First put your OpenAI API key in a file named api_key.txt.

Run LLM generation

Details could be found in the param descriptions in main_*.py. For example, to run the invalid reasoning setting on GSM8K and Bamboogle:

python main_gsm8k.py --prompt_dir prompts_arithmetic/invalid_reasoning.txt --eng text-davinci-002 --num_test 800 --seed 1357 --temp 0.0 --test_ind indices_800.json

python main_bamboogle.py --prompt_dir prompts_bamboogle/invalid_reasoning.txt --eng text-davinci-002 --num_test -1 --seed 1357 --temp 0.0

Evaluation

eval_*.ipynb contains the scripts and cached evaluation results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Overview

Citation

Repo Tour

Usage

Run LLM generation

Evaluation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
grade-school-math		grade-school-math
prompts_arithmetic		prompts_arithmetic
prompts_bamboogle		prompts_bamboogle
result_bamboogle		result_bamboogle
result_gsm8k		result_gsm8k
.gitignore		.gitignore
Bamboogle Prerelease - Sheet1.csv		Bamboogle Prerelease - Sheet1.csv
Bamboogle Prerelease - Sheet1_inter.csv		Bamboogle Prerelease - Sheet1_inter.csv
README.md		README.md
eval_bamboogle.ipynb		eval_bamboogle.ipynb
eval_gsm8k.ipynb		eval_gsm8k.ipynb
indices_800.json		indices_800.json
main_bamboogle.py		main_bamboogle.py
main_gsm8k.py		main_gsm8k.py
utils.py		utils.py

sunlab-osu/Understanding-CoT

Folders and files

Latest commit

History

Repository files navigation

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Overview

Citation

Repo Tour

Usage

Run LLM generation

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages