deepvk · SpirinEgor · Jun 24, 2024 · Apr 27, 2024 · May 21, 2024 · May 21, 2024
diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml
@@ -13,6 +13,7 @@ jobs:
  cache: "pip"
  - name: "installation"
  run: |
+ pip install --upgrade pip
  pip install -r requirements.txt -r requirements.dev.txt
  - name: "black"
  run: black . --check --diff --color

diff --git a/README.md b/README.md
@@ -1,11 +1,25 @@
-# Parser for tasks from "Sdamgia"
+# GOAT
 
-This parser is using scrapy lib.
+This project consists of three different subprojects:
+- Parser for tasks from Russian USE;
+- Script for validation of HF models on GOAT dataset;
+- Web app with models' leaderboard after their validation on GOAT dataset;
+
+
+## Installation
+Firstly, you need to install the necessary libraries. To do this, run the following commands:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Parser
+This parser was used to gather tasks for GOAT dataset. It is using Scrapy lib.
 
 Currently, program parses tests from the Unified State Exam (EGE or OGE)
 from the [sdamgia](https://sdamgia.ru/?redir=1) website.
 
-## Structure
+### Structure
 
 Program takes exam subject, exam type, test id and the desired output file
 name as command-line arguments. The parsing result is supposed to be stored in a jsonl file.
@@ -15,16 +29,22 @@ Additionally, in the *goat* folder, there is a script called **dataset_demonstra
 After you run it (instructions on how to run it are provided below), it will display one task of each type
 from the parsed test in the console.
 
-## Usage
-
-First, you need to install the necessary libraries. To do this, run the following command from the root folder:
+### Usage
+Firstly, you need to install parser dependencies. To do this, run the following command:
 
-`pip install -r requirements.txt`
+```bash
+pip install -e ".[parser]"
+```
 
-To run the parser, navigate to the goat directory
-and run the following command in the console:
+To run the parser, run the following command from goat/parser directory:
 
-`scrapy crawl sdamgia -a subject='your exam subject' -a exam_type='your exam type' -a test_id='your test id' -O <output file>`
+```bash
+scrapy crawl sdamgia \
+ -a subject='your exam subject' \
+ -a exam_type='your exam type' \
+ -a test_id='your test id' \
+ -O <output file>
+```
 
 *your exam subject* indicates which subject the exam is in. Currently acceptable subject values are 'soc' and 'lit'.
 
@@ -34,8 +54,56 @@ and run the following command in the console:
 
 *output file* is file name that parser will generate or overwrite with parsing output. For example - ege_data.jsonl.
 
-To run the dataset_demonstration.py script, execute the following command in the root directory:
+To run the dataset_demonstration.py script, execute the following command from the root directory:
 
-`python .\goat\dataset_demonstration.py -f <parser output file name>`
+`python goat/parser/dataset_demonstration.py -f <parser output file name>`
 
 where *parser output file name* is the name of the jsonl file that parser has generated.
+
+## Leaderboard frontend
+
+### Structure
+My leaderboard follows similar structure that [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) uses.
+It is a gradio web app that is used in a HuggingFace space. Database info is stored in environment variables.
+
+### Usage
+Firstly, you need to install frontend dependencies. To do this, run the following command:
+
+```bash
+pip install -e ".[frontend]"
+```
+
+In this app you can send your model validation request to
+backend database and after some time validation result on your model will appear
+in the leaderboard folder after reloading the app.
+
+To run leaderboard web app execute this command from root directory
+(it is supposed that you have set all needed environment variables for database connection):
+
+`python -m goat.frontend.app`
+
+## Leaderboard backend
+
+### Structure
+Leaderboard backend after receiving new validation request validate
+the model in the request on GOAT dataset using modified
+[LM Evaluation Harness benchmark](https://github.com/deepvk/lm-evaluation-harness/tree/goat) and [FastChat LLM-as-judge benchmark](https://github.com/deepvk/FastChat/tree/goat/fastchat/llm_judge) from deepvk repositories.
+After finishing validation it adds the resulting scores in the leaderboard.
+
+### Usage
+Firstly, you need to install backend dependencies. To do this, run the following commands:
+
+```bash
+pip install -e ".[backend]"
+pip install -U wheel
+pip install flash-attn==2.5.8
+```
+
+To run leaderboard backend execute this command from root directory
+(it is supposed that you have set all needed environment variables for database connection):
+
+`python -m goat.backend.eval`
+
+After running the script, it will listen to new validation requests in the database.
+After receiving new request it will start validating the model in the request on GOAT dataset.
+After getting results of the validation it will add these results in leaderboard table in database.
diff --git a/goat/backend/__init__.py b/goat/backend/__init__.py
diff --git a/goat/backend/add_results.py b/goat/backend/add_results.py
@@ -0,0 +1,77 @@
+import json
+
+from datasets import get_dataset_config_names, load_dataset
+
+from goat.utils.database_helper import DatabaseHelper, EvalResult
+
+
+def get_datasets_len(tasks: list[str]) -> dict[str, int]:
+ datasets_len = dict()
+ datasets_len["single_choice"] = 0
+ datasets_len["multiple_choice"] = 0
+ datasets_len["word_gen"] = 0
+
+ for task in tasks:
+ dataset = load_dataset("deepvk/goat", task, split="test")
+ datasets_len[task] = len(dataset)
+ if "single_choice" in task:
+ datasets_len["single_choice"] += len(dataset)
+ elif "multiple_choice" in task:
+ datasets_len["multiple_choice"] += len(dataset)
+ elif "word_gen" in task:
+ datasets_len["word_gen"] += len(dataset)
+ return datasets_len
+
+
+def get_metrics_values(
+ tasks: list[str], evaluation: dict[str, dict], datasets_len: dict[str, int]
+) -> tuple[float, float, float]:
+ metrics = [
+ "multi_choice_em_unordered,get-answer",
+ "word_in_set,none",
+ "acc,none",
+ ]
+
+ single_choice_score = 0.0
+ multiple_choice_score = 0.0
+ word_gen_score = 0.0
+
+ for task in tasks:
+ for metric in metrics:
+ if metric in evaluation[task].keys():
+ if "single_choice" in task:
+ single_choice_score += datasets_len[task] * evaluation[task][metric]
+ elif "multiple_choice" in task:
+ multiple_choice_score += datasets_len[task] * evaluation[task][metric]
+ elif "word_gen" in task:
+ word_gen_score += datasets_len[task] * evaluation[task][metric]
+ break
+
+ single_choice_score /= datasets_len["single_choice"]
+ multiple_choice_score /= datasets_len["multiple_choice"]
+ word_gen_score /= datasets_len["word_gen"]
+
+ return single_choice_score, multiple_choice_score, word_gen_score
+
+
+def add_results(input_path: str) -> None:
+ with open(input_path, "r") as j:
+ contents = json.loads(j.read())
+ evaluation = contents["results"]
+ tasks = get_dataset_config_names("deepvk/goat")
+
+ datasets_len = get_datasets_len(tasks)
+ single_choice_score, multiple_choice_score, word_gen_score = get_metrics_values(tasks, evaluation, datasets_len)
+
+ model_name = contents["config"]["model"]
+
+ eval_result = EvalResult(
+ model=model_name,
+ single_choice=single_choice_score,
+ multiple_choice=multiple_choice_score,
+ word_gen=word_gen_score,
+ )
+
+ db = DatabaseHelper()
+ db.add_eval_result(eval_result)
+ db.end_connection()
diff --git a/goat/backend/data/question.jsonl b/goat/backend/data/question.jsonl
diff --git a/goat/backend/eval.py b/goat/backend/eval.py
@@ -0,0 +1,49 @@
+import json
+import os
+from pathlib import Path
+
+from fastchat.llm_judge.gen_model_answer import run_eval
+from fastchat.utils import str_to_torch_dtype
+from lm_eval import evaluator
+from lm_eval.models.huggingface import HFLM
+
+from goat.backend.add_results import add_results
+from goat.utils.database_helper import DatabaseHelper
+
+
+def eval(model_name: str, precision: str) -> None:
+ lm = HFLM(pretrained=model_name, dtype=precision)
+ taskname = "goat"
+ results = evaluator.simple_evaluate(model=lm, tasks=[taskname])
+
+ model_id = model_name.replace("/", "__")
+ Path(f"goat/backend/results/{model_id}").mkdir(exist_ok=True, parents=True)
+ lm_eval_output_file = f"goat/backend/results/{model_id}/{model_id + '_lm_eval'}.json"
+ with open(lm_eval_output_file, "w", encoding="utf-8") as f:
+ json.dump(results, f, ensure_ascii=False)
+
+ fastchat_filename = os.path.join(f"goat/backend/results/{model_id}", model_id + "_fastchat.jsonl")
+ question_file = "goat/backend/data/question.jsonl"
+
+ run_eval(
+ model_path=model_name,
+ model_id=model_id,
+ answer_file=fastchat_filename,
+ question_file=question_file,
+ question_begin=None,
+ question_end=None,
+ max_new_token=1024,
+ num_choices=1,
+ num_gpus_per_model=1,
+ num_gpus_total=1,
+ max_gpu_memory=None,
+ dtype=str_to_torch_dtype(precision),
+ revision="main",
+ )
+
+ add_results(input_path=lm_eval_output_file)
+
+
+if __name__ == "__main__":
+ db_helper = DatabaseHelper()
+ db_helper.listen_to_new_requests(eval)
diff --git a/goat/database/bd_init_script.sql b/goat/database/bd_init_script.sql
@@ -0,0 +1,38 @@
+create table if not exists public.leaderboard
+(
+ model varchar not null
+ primary key,
+ single_choice double precision,
+ multiple_choice double precision,
+ word_gen double precision
+);
+
+alter table public.leaderboard
+ owner to habrpguser;
+
+create table if not exists public.eval_requests
+(
+ id serial
+ constraint eval_requests_pk
+ primary key,
+ model_name varchar not null,
+ precision varchar not null
+);
+
+alter table public.eval_requests
+ owner to habrpguser;
+
+alter table public.eval_requests
+ owner to habrpguser;
+
+create or replace function notify_id_trigger()
+returns trigger as $$
+begin
+ perform pg_notify('id'::text, NEW."id"::text);
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger trigger1
+after insert or update on public."eval_requests"
+for each row execute procedure notify_id_trigger();
diff --git a/goat/frontend/__init__.py b/goat/frontend/__init__.py
diff --git a/goat/frontend/app.py b/goat/frontend/app.py
@@ -0,0 +1,52 @@
+import gradio as gr
+
+from goat.frontend.precision import Precision
+
+from ..utils.database_helper import DatabaseHelper
+
+TITLE = """<h1 style="text-align:left;float:left; id="space-title">🤗 GOAT Leaderboard</h1>"""
+INTRODUCTION_TEXT = """
+ Generalized Occupational Aptitude Test (GOAT) Leaderboard. For project details, refer to the <a href="https://github.com/deepvk/goat" target="_blank" style="text-decoration: underline">GOAT GitHub repository</a>.
+ """
+
+
+db_helper = DatabaseHelper()
+
+leaderboard_df = db_helper.get_leaderboard_df()
+
+
+demo = gr.Blocks(css="src_display.css")
+with demo:
+ gr.HTML(TITLE)
+ gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
+
+ with gr.Tabs(elem_classes="tab-buttons") as tabs:
+ with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=1):
+ leaderboard_table = gr.components.Dataframe(
+ value=leaderboard_df,
+ headers=["Model", "GOAT"],
+ interactive=False,
+ )
+ with gr.TabItem("🚀 Submit ", elem_id="llm-benchmark-tab-table", id=5):
+ with gr.Row():
+ gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
+
+ with gr.Row():
+ with gr.Column():
+ model_name = gr.Textbox(label="Model name on HF")
+ model_precision = gr.Dropdown(
+ choices=[i.value for i in Precision if i != Precision.Unknown],
+ label="Precision",
+ multiselect=False,
+ value="float16",
+ interactive=True,
+ )
+ submit_button = gr.Button("Submit Eval")
+ submission_result = gr.Markdown()
+ submit_button.click(
+ db_helper.add_eval_request,
+ [model_name, model_precision],
+ submission_result,
+ )
+
+demo.queue(default_concurrency_limit=40).launch()
diff --git a/goat/frontend/precision.py b/goat/frontend/precision.py
@@ -0,0 +1,18 @@
+from enum import Enum
+
+
+class Precision(Enum):
+ float16 = "float16"
+ bfloat16 = "bfloat16"
+ float32 = "float32"
+ Unknown = "?"
+
+
+def from_str(precision: str) -> Precision:
+ if precision in ["torch.float16", "float16"]:
+ return Precision.float16
+ if precision in ["torch.float32", "float32"]:
+ return Precision.float32
+ if precision in ["torch.bfloat16", "bfloat16"]:
+ return Precision.bfloat16
+ return Precision.Unknown