Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaderboard #3

Merged
merged 11 commits into from
Jun 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ jobs:
cache: "pip"
- name: "installation"
run: |
pip install --upgrade pip
pip install -r requirements.txt -r requirements.dev.txt
- name: "black"
run: black . --check --diff --color
Expand Down
92 changes: 80 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,25 @@
# Parser for tasks from "Sdamgia"
# GOAT

This parser is using scrapy lib.
This project consists of three different subprojects:
- Parser for tasks from Russian USE;
- Script for validation of HF models on GOAT dataset;
- Web app with models' leaderboard after their validation on GOAT dataset;


## Installation
Firstly, you need to install the necessary libraries. To do this, run the following commands:

```bash
pip install -r requirements.txt
```

## Parser
This parser was used to gather tasks for GOAT dataset. It is using Scrapy lib.

Currently, program parses tests from the Unified State Exam (EGE or OGE)
from the [sdamgia](https://sdamgia.ru/?redir=1) website.

## Structure
### Structure

Program takes exam subject, exam type, test id and the desired output file
name as command-line arguments. The parsing result is supposed to be stored in a jsonl file.
Expand All @@ -15,16 +29,22 @@ Additionally, in the *goat* folder, there is a script called **dataset_demonstra
After you run it (instructions on how to run it are provided below), it will display one task of each type
from the parsed test in the console.

## Usage

First, you need to install the necessary libraries. To do this, run the following command from the root folder:
### Usage
Firstly, you need to install parser dependencies. To do this, run the following command:

`pip install -r requirements.txt`
```bash
pip install -e ".[parser]"
```

To run the parser, navigate to the goat directory
and run the following command in the console:
To run the parser, run the following command from goat/parser directory:

`scrapy crawl sdamgia -a subject='your exam subject' -a exam_type='your exam type' -a test_id='your test id' -O <output file>`
```bash
scrapy crawl sdamgia \
-a subject='your exam subject' \
-a exam_type='your exam type' \
-a test_id='your test id' \
-O <output file>
```

*your exam subject* indicates which subject the exam is in. Currently acceptable subject values are 'soc' and 'lit'.

Expand All @@ -34,8 +54,56 @@ and run the following command in the console:

*output file* is file name that parser will generate or overwrite with parsing output. For example - ege_data.jsonl.

To run the dataset_demonstration.py script, execute the following command in the root directory:
To run the dataset_demonstration.py script, execute the following command from the root directory:

`python .\goat\dataset_demonstration.py -f <parser output file name>`
`python goat/parser/dataset_demonstration.py -f <parser output file name>`

where *parser output file name* is the name of the jsonl file that parser has generated.

## Leaderboard frontend

### Structure
My leaderboard follows similar structure that [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) uses.
It is a gradio web app that is used in a HuggingFace space. Database info is stored in environment variables.

### Usage
Firstly, you need to install frontend dependencies. To do this, run the following command:

```bash
pip install -e ".[frontend]"
```

In this app you can send your model validation request to
backend database and after some time validation result on your model will appear
in the leaderboard folder after reloading the app.

To run leaderboard web app execute this command from root directory
(it is supposed that you have set all needed environment variables for database connection):

`python -m goat.frontend.app`

## Leaderboard backend

### Structure
Leaderboard backend after receiving new validation request validate
the model in the request on GOAT dataset using modified
[LM Evaluation Harness benchmark](https://github.com/deepvk/lm-evaluation-harness/tree/goat) and [FastChat LLM-as-judge benchmark](https://github.com/deepvk/FastChat/tree/goat/fastchat/llm_judge) from deepvk repositories.
After finishing validation it adds the resulting scores in the leaderboard.

### Usage
Firstly, you need to install backend dependencies. To do this, run the following commands:

```bash
pip install -e ".[backend]"
pip install -U wheel
pip install flash-attn==2.5.8
```

To run leaderboard backend execute this command from root directory
(it is supposed that you have set all needed environment variables for database connection):

`python -m goat.backend.eval`

After running the script, it will listen to new validation requests in the database.
After receiving new request it will start validating the model in the request on GOAT dataset.
After getting results of the validation it will add these results in leaderboard table in database.
Empty file added goat/backend/__init__.py
Empty file.
77 changes: 77 additions & 0 deletions goat/backend/add_results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import json

from datasets import get_dataset_config_names, load_dataset

from goat.utils.database_helper import DatabaseHelper, EvalResult


def get_datasets_len(tasks: list[str]) -> dict[str, int]:
datasets_len = dict()
datasets_len["single_choice"] = 0
datasets_len["multiple_choice"] = 0
datasets_len["word_gen"] = 0

for task in tasks:
dataset = load_dataset("deepvk/goat", task, split="test")
datasets_len[task] = len(dataset)
if "single_choice" in task:
datasets_len["single_choice"] += len(dataset)
elif "multiple_choice" in task:
datasets_len["multiple_choice"] += len(dataset)
elif "word_gen" in task:
datasets_len["word_gen"] += len(dataset)
return datasets_len


def get_metrics_values(
tasks: list[str], evaluation: dict[str, dict], datasets_len: dict[str, int]
) -> tuple[float, float, float]:
metrics = [
"multi_choice_em_unordered,get-answer",
"word_in_set,none",
"acc,none",
]

single_choice_score = 0.0
multiple_choice_score = 0.0
word_gen_score = 0.0

for task in tasks:
for metric in metrics:
if metric in evaluation[task].keys():
if "single_choice" in task:
single_choice_score += datasets_len[task] * evaluation[task][metric]
elif "multiple_choice" in task:
multiple_choice_score += datasets_len[task] * evaluation[task][metric]
elif "word_gen" in task:
word_gen_score += datasets_len[task] * evaluation[task][metric]
break

single_choice_score /= datasets_len["single_choice"]
multiple_choice_score /= datasets_len["multiple_choice"]
word_gen_score /= datasets_len["word_gen"]

return single_choice_score, multiple_choice_score, word_gen_score


def add_results(input_path: str) -> None:
with open(input_path, "r") as j:
contents = json.loads(j.read())
evaluation = contents["results"]
tasks = get_dataset_config_names("deepvk/goat")

datasets_len = get_datasets_len(tasks)
single_choice_score, multiple_choice_score, word_gen_score = get_metrics_values(tasks, evaluation, datasets_len)

model_name = contents["config"]["model"]

eval_result = EvalResult(
model=model_name,
single_choice=single_choice_score,
multiple_choice=multiple_choice_score,
word_gen=word_gen_score,
)

db = DatabaseHelper()
db.add_eval_result(eval_result)
db.end_connection()
826 changes: 826 additions & 0 deletions goat/backend/data/question.jsonl

Large diffs are not rendered by default.

49 changes: 49 additions & 0 deletions goat/backend/eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import json
import os
from pathlib import Path

from fastchat.llm_judge.gen_model_answer import run_eval
from fastchat.utils import str_to_torch_dtype
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

from goat.backend.add_results import add_results
from goat.utils.database_helper import DatabaseHelper


def eval(model_name: str, precision: str) -> None:
lm = HFLM(pretrained=model_name, dtype=precision)
taskname = "goat"
results = evaluator.simple_evaluate(model=lm, tasks=[taskname])

model_id = model_name.replace("/", "__")
Path(f"goat/backend/results/{model_id}").mkdir(exist_ok=True, parents=True)
lm_eval_output_file = f"goat/backend/results/{model_id}/{model_id + '_lm_eval'}.json"
with open(lm_eval_output_file, "w", encoding="utf-8") as f:
json.dump(results, f, ensure_ascii=False)

fastchat_filename = os.path.join(f"goat/backend/results/{model_id}", model_id + "_fastchat.jsonl")
question_file = "goat/backend/data/question.jsonl"

run_eval(
model_path=model_name,
model_id=model_id,
answer_file=fastchat_filename,
question_file=question_file,
question_begin=None,
question_end=None,
max_new_token=1024,
num_choices=1,
num_gpus_per_model=1,
num_gpus_total=1,
max_gpu_memory=None,
dtype=str_to_torch_dtype(precision),
revision="main",
)

add_results(input_path=lm_eval_output_file)


if __name__ == "__main__":
db_helper = DatabaseHelper()
db_helper.listen_to_new_requests(eval)
38 changes: 38 additions & 0 deletions goat/database/bd_init_script.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
create table if not exists public.leaderboard
(
model varchar not null
primary key,
single_choice double precision,
multiple_choice double precision,
word_gen double precision
);

alter table public.leaderboard
owner to habrpguser;

create table if not exists public.eval_requests
(
id serial
constraint eval_requests_pk
primary key,
model_name varchar not null,
precision varchar not null
);

alter table public.eval_requests
owner to habrpguser;

alter table public.eval_requests
owner to habrpguser;

create or replace function notify_id_trigger()
returns trigger as $$
begin
perform pg_notify('id'::text, NEW."id"::text);
return new;
end;
$$ language plpgsql;

create trigger trigger1
after insert or update on public."eval_requests"
for each row execute procedure notify_id_trigger();
Empty file added goat/frontend/__init__.py
Empty file.
52 changes: 52 additions & 0 deletions goat/frontend/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import gradio as gr

from goat.frontend.precision import Precision

from ..utils.database_helper import DatabaseHelper

TITLE = """<h1 style="text-align:left;float:left; id="space-title">🤗 GOAT Leaderboard</h1>"""
INTRODUCTION_TEXT = """
Generalized Occupational Aptitude Test (GOAT) Leaderboard. For project details, refer to the <a href="https://github.com/deepvk/goat" target="_blank" style="text-decoration: underline">GOAT GitHub repository</a>.
"""


db_helper = DatabaseHelper()

leaderboard_df = db_helper.get_leaderboard_df()


demo = gr.Blocks(css="src_display.css")
with demo:
gr.HTML(TITLE)
gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")

with gr.Tabs(elem_classes="tab-buttons") as tabs:
with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=1):
leaderboard_table = gr.components.Dataframe(
value=leaderboard_df,
headers=["Model", "GOAT"],
interactive=False,
)
with gr.TabItem("🚀 Submit ", elem_id="llm-benchmark-tab-table", id=5):
with gr.Row():
gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")

with gr.Row():
with gr.Column():
model_name = gr.Textbox(label="Model name on HF")
model_precision = gr.Dropdown(
choices=[i.value for i in Precision if i != Precision.Unknown],
label="Precision",
multiselect=False,
value="float16",
interactive=True,
)
submit_button = gr.Button("Submit Eval")
submission_result = gr.Markdown()
submit_button.click(
db_helper.add_eval_request,
[model_name, model_precision],
submission_result,
)

demo.queue(default_concurrency_limit=40).launch()
18 changes: 18 additions & 0 deletions goat/frontend/precision.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from enum import Enum


class Precision(Enum):
float16 = "float16"
bfloat16 = "bfloat16"
float32 = "float32"
Unknown = "?"


def from_str(precision: str) -> Precision:
if precision in ["torch.float16", "float16"]:
return Precision.float16
if precision in ["torch.float32", "float32"]:
return Precision.float32
if precision in ["torch.bfloat16", "bfloat16"]:
return Precision.bfloat16
return Precision.Unknown
Loading
Loading