Made in Kazakhstan - Қазақстанда жасалған
This framework provides a standardized approach to evaluating Large Language Models (LLMs) using established benchmarks. It implements a systematic process for testing model performance across various cognitive and technical tasks.
The benchmarking process follows these key steps:
- Input Data: Collection of benchmark-specific datasets
- Prompt Generation: Creation of tailored task-specific prompts
- Model Execution: Processing prompts through the LLM
- Evaluation: Comparison with ground truth using appropriate metrics
- Result Aggregation: Computing and storing performance metrics
-
Check if Docker is installed:
docker --version
If Docker is not installed, refer to the official Docker installation guide.
-
Check if Docker Compose is installed:
docker-compose --version
If Docker Compose is not installed, refer to the official Docker Compose installation guide.
-
Check if CUDA and GPUs are available:
nvidia-smi
If CUDA is not configured or GPUs are not detected, refer to the CUDA Toolkit Installation Guide.
-
Check if NVIDIA Docker is installed:
nvidia-docker --version
If NVIDIA Docker is not installed, run the following command in the project root directory (where the
Makefile
is located):make install_nvidia_docker
For further information, refer to the NVIDIA Docker installation guide.
-
Configure GPU access: In the
docker-compose.yaml
file, set theNVIDIA_VISIBLE_DEVICES
environment variable to specify the GPUs you want to use.
Edit the conf/parameters_benchmark.yaml
file to set your desired configurations for benchmarking.
To download the required datasets for benchmarking, run:
make run_via_compose DIR=src/utils/download_dataset.py
If Docker images need to be built, run:
make build_via_compose
To start the benchmarking process, run:
make run_via_compose DIR=src/main.py
- Description: Tests knowledge across 57 domains including STEM, humanities, and social sciences
- Input: Multiple-choice questions (A, B, C, D)
- Output: Single letter selection
- Metric: Accuracy
- Shot Setting: Zero-shot
- Description: Evaluates logical reasoning and domain knowledge
- Input: Question with four options (A, B, C, D)
- Output: Single letter selection
- Metric: Accuracy
- Shot Setting: Zero-shot
- Description: Tests sentence completion plausibility
- Input: Context with four possible endings
- Output: Number selection (1-4)
- Metric: Accuracy
- Shot Setting: Zero-shot
- Description: Assesses commonsense reasoning through sentence completion
- Input: Sentence with blank and two options
- Output: Number selection (1 or 2)
- Metric: Accuracy
- Shot Setting: Zero-shot
- Description: Evaluates multi-step mathematical problem-solving
- Input: Math problem with three solved examples
- Output: Numerical answer
- Metric: Numerical accuracy
- Shot Setting: Three-shot chain-of-thought
- Description: Tests reading comprehension and numerical reasoning
- Input: Passage and question
- Output: Text or numerical answer
- Metric: Exact match accuracy
- Shot Setting: Zero-shot
- Description: Assesses Python code generation capabilities
- Input: Function definition prompt
- Output: Complete Python function
- Metric: Pass@1
- Shot Setting: Zero-shot
The framework employs two primary shot settings:
-
Zero-Shot: Used for most benchmarks
- No examples provided
- Clear task description and instructions only
-
Three-Shot Chain-of-Thought: Used for GSM8K
- Includes three worked examples
- Guides step-by-step problem solving
The framework uses various metrics depending on the benchmark:
-
Accuracy: Used for:
- MMLU
- ARC
- HellaSwag
- Winogrande
-
Exact Match: Used for:
- DROP (with normalization for formatting)
-
Numerical Accuracy: Used for:
- GSM8K
-
Pass@1: Used for:
- HumanEval