Code implementation and data for the paper:
Akshara Prabhakar, Thomas L. Griffiths, R. Thomas McCoy
We construct a dataset of seven-letter words divided into 5 probability bins {bin1 to bin 5} each having around 150 words (first 100 to evaluate GPT-4 and remaining to evaluate the logistic regression model that was fitted on the first 100 words). The binning is done based on the log probability value assigned by GPT-2.
The seven-letter word dataset is in seven_letter_words:
- bin1_prob.txt
- bin2_prob.txt
- bin3_prob.txt
- bin4_prob.txt
- bin5_prob.txt
Using the seven-letter word dataset, we prepare stimuli -- these are shift cipher encoded versions of the words from the 5 probability bins across 25 shift levels (1 to 25).
The stimuli are prepared for the different types of prompts we use: standard
, text_cot
, math_cot
, number_cot
.
Can be created by running,
python stimulus_generator.py --prompt_type <text_cot>
- GPT-4:
run_openai.py
- Llama 3.1:
run_llama.py
- Claude 3:
run_claude.py
Set appropriate OpenAI, Together, Anthropic keys in the environment before running evaluations.
For example to run experiments on GPT-4 with Text-CoT for shift_level=1 across all 5 bins run,
python run_openai.py --tasks text_cot1 --conditions bin1,bin2,bin3,bin4,bin5 --max_tokens 200 --prompt_type text_cot
To evaluate the generations, run
python eval.py --prompt_type text_cot --create_stats_table
Run this after evaluating GPT-4 across all shift levels and bins. This will generate the evluation statistics for text_cot
across all shift levels and the {prompt_type}_train_table.tsv
file which is the train statistics table for fitting the logistic regression.
The logistic regression is implemented in R in regression.ipynb. The predictions on the test set are saved in regression/text_cot_test_results.tsv
.
All model generations and outputs are stored in the logs
directory.
If you find this repository helpful, feel free to cite our publication.
@misc{prabhakar2024decipheringfactorsinfluencingefficacy,
title={Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning},
author={Akshara Prabhakar and Thomas L. Griffiths and R. Thomas McCoy},
year={2024},
eprint={2407.01687},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.01687},
}