LLM Benchmarking Framework

Made in Kazakhstan - Қазақстанда жасалған

Overview

This framework provides a standardized approach to evaluating Large Language Models (LLMs) using established benchmarks. It implements a systematic process for testing model performance across various cognitive and technical tasks.

General Approach

The benchmarking process follows these key steps:

Input Data: Collection of benchmark-specific datasets
Prompt Generation: Creation of tailored task-specific prompts
Model Execution: Processing prompts through the LLM
Evaluation: Comparison with ground truth using appropriate metrics
Result Aggregation: Computing and storing performance metrics

Usage

Prerequisites

Check if Docker is installed:
```
docker --version
```
If Docker is not installed, refer to the official Docker installation guide.
Check if Docker Compose is installed:
```
docker-compose --version
```
If Docker Compose is not installed, refer to the official Docker Compose installation guide.
Check if CUDA and GPUs are available:
```
nvidia-smi
```
If CUDA is not configured or GPUs are not detected, refer to the CUDA Toolkit Installation Guide.
Check if NVIDIA Docker is installed:
```
nvidia-docker --version
```
If NVIDIA Docker is not installed, run the following command in the project root directory (where the Makefile is located):
```
make install_nvidia_docker
```
For further information, refer to the NVIDIA Docker installation guide.
Configure GPU access: In the docker-compose.yaml file, set the NVIDIA_VISIBLE_DEVICES environment variable to specify the GPUs you want to use.

Setting Configurations

Edit the conf/parameters_benchmark.yaml file to set your desired configurations for benchmarking.

Download Datasets

To download the required datasets for benchmarking, run:

make run_via_compose DIR=src/utils/download_dataset.py

Build Docker Images (if required)

If Docker images need to be built, run:

make build_via_compose

Run Benchmark

To start the benchmarking process, run:

make run_via_compose DIR=src/main.py

Benchmarks

MMLU (Massive Multitask Language Understanding)

Description: Tests knowledge across 57 domains including STEM, humanities, and social sciences
Input: Multiple-choice questions (A, B, C, D)
Output: Single letter selection
Metric: Accuracy
Shot Setting: Zero-shot

ARC (AI2 Reasoning Challenge)

Description: Evaluates logical reasoning and domain knowledge
Input: Question with four options (A, B, C, D)
Output: Single letter selection
Metric: Accuracy
Shot Setting: Zero-shot

HellaSwag

Description: Tests sentence completion plausibility
Input: Context with four possible endings
Output: Number selection (1-4)
Metric: Accuracy
Shot Setting: Zero-shot

Winogrande

Description: Assesses commonsense reasoning through sentence completion
Input: Sentence with blank and two options
Output: Number selection (1 or 2)
Metric: Accuracy
Shot Setting: Zero-shot

GSM8K (Grade School Math 8K)

Description: Evaluates multi-step mathematical problem-solving
Input: Math problem with three solved examples
Output: Numerical answer
Metric: Numerical accuracy
Shot Setting: Three-shot chain-of-thought

DROP (Discrete Reasoning Over Paragraphs)

Description: Tests reading comprehension and numerical reasoning
Input: Passage and question
Output: Text or numerical answer
Metric: Exact match accuracy
Shot Setting: Zero-shot

HumanEval

Description: Assesses Python code generation capabilities
Input: Function definition prompt
Output: Complete Python function
Metric: Pass@1
Shot Setting: Zero-shot

Shot Settings

The framework employs two primary shot settings:

Zero-Shot: Used for most benchmarks
- No examples provided
- Clear task description and instructions only
Three-Shot Chain-of-Thought: Used for GSM8K
- Includes three worked examples
- Guides step-by-step problem solving

Evaluation Metrics

The framework uses various metrics depending on the benchmark:

Accuracy: Used for:
- MMLU
- ARC
- HellaSwag
- Winogrande
Exact Match: Used for:
- DROP (with normalization for formatting)
Numerical Accuracy: Used for:
- GSM8K
Pass@1: Used for:
- HumanEval

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
conf		conf
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
install_nvidia_docker.sh		install_nvidia_docker.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Benchmarking Framework

Overview

General Approach

Usage

Prerequisites

Setting Configurations

Download Datasets

Build Docker Images (if required)

Run Benchmark

Benchmarks

MMLU (Massive Multitask Language Understanding)

ARC (AI2 Reasoning Challenge)

HellaSwag

Winogrande

GSM8K (Grade School Math 8K)

DROP (Discrete Reasoning Over Paragraphs)

HumanEval

Shot Settings

Evaluation Metrics

About

Releases

Packages

Languages

License

IS2AI/KazLLM_Benchmark

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmarking Framework

Overview

General Approach

Usage

Prerequisites

Setting Configurations

Download Datasets

Build Docker Images (if required)

Run Benchmark

Benchmarks

Shot Settings

Evaluation Metrics

About

Resources

License

Stars

Watchers

Forks

Languages