A Careful Examination of Large Language Model Performance on Grade School Arithmetic

This repository contains the evaluation code for the paper "A Careful Examination of Large Language Model Performance on Grade School Arithmetic" (Zhang et al., 2024).

Overview

GSM1k is a novel dataset designed to measure potential overfitting of language models on the established GSM8k benchmark. Our work provides insights into the true mathematical reasoning capabilities of current state-of-the-art language models. This code is based off of a fork of and released under the MIT license.

At the present, we release only 50 examples from GSM1k to prevent worries around data contamination. When 3 open source models of different lineages reach 95+% accuracy on GSM1k, we commit to releasing the full set. Code is provided as-is and no further development is intended, aside from the full release of GSM1k when the paper conditions are met.

Installation

conda create --name gsm1k python=3.10

pip3 install -e .
pip3 install -r requirements.txt
pip3 install transformers # likely you want the latest version of transformers
pip3 install flash-attn==2.5.8 --no-build-isolation

Running commands

Example for Llama-3-8B-Instruct

lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,trust_remote_code=True --batch_size auto --tasks gsm8k --seed 0 --gen_kwargs temperature=0.0 --output_path Meta-Llama-3-8B-Instruct_gsm8k_temperature=0.0_results.json --log_samples
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,trust_remote_code=True --batch_size auto --tasks gsm1k --seed 0 --gen_kwargs temperature=0.0 --output_path Meta-Llama-3-8B-Instruct_gsm1k_temperature=0.0_results.json --log_samples

Note: due to API changes (latest model API versions changing over time), nondeterminism of API reuslts, and slightly different library versions of the HuggingFace transformers libraries required to run each model, the precise results may not be fully exactly reproducible.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
docs		docs
examples		examples
lm_eval		lm_eval
loglikelihood		loglikelihood
scripts		scripts
templates/new_yaml_task		templates/new_yaml_task
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.bib		CITATION.bib
CODEOWNERS		CODEOWNERS
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
all_api_models.txt		all_api_models.txt
all_models_to_evaluate.txt		all_models_to_evaluate.txt
gsm1k_public_50.csv		gsm1k_public_50.csv
ignore.txt		ignore.txt
mypy.ini		mypy.ini
pile_statistics.json		pile_statistics.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Overview

Installation

Running commands

About

Releases

Packages

Contributors 3

Languages

License

scaleapi/gsm1k_eval

Folders and files

Latest commit

History

Repository files navigation

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Overview

Installation

Running commands

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages