Skip to content

Latest commit

ย 

History

History
128 lines (115 loc) ยท 6 KB

REPRODUCTION.md

File metadata and controls

128 lines (115 loc) ยท 6 KB

KNOW๊ธฐ๋ฐ˜ ์ง์—… ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฒฝ์ง„๋Œ€ํšŒ

Introduction

๋ณธ ๋ฌธ์„œ๋Š” KNOW๊ธฐ๋ฐ˜ ์ง์—… ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฒฝ์ง„๋Œ€ํšŒ Private 2nd ์ฝ”๋“œ ๋ฐ ์ ์ˆ˜ ์žฌํ˜„ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์„œ์ˆ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Prerequisites

๋ณธ ์ฝ”๋“œ๋Š” ํ›„์ˆ ํ•  ํ•œ๊ตญ์–ด ์ž„๋ฒ ๋”ฉ์˜ ๋ฒ„์ „ ์ถฉ๋Œ ๋ฌธ์ œ๋กœ ์ธํ•ด ๊ฐ€์ƒํ™˜๊ฒฝ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  • base
    • numpy
    • omegaconf
    • pandas
    • pytorch_lightning
    • scikit_learn
    • torch==1.10.1
    • transformers
    • wandb
  • transformers_4_15_0
    • numpy
    • pandas
    • pytorch_lightning
    • scikit_learn
    • torch==1.10.1
    • transformers==4.15.0
    • tqdm
  • transformers_2_8_0
    • boto3
    • gluonnlp >= 0.6.0
    • mxnet >= 1.4.0
    • onnxruntime == 1.8.0
    • sentencepiece >= 0.1.6
    • torch >= 1.7.0
    • transformers == 2.8.0
    • tqdm

transformers_4_15_0์™€ transformers_2_8_0 ํ™˜๊ฒฝ์€ ํ›„์ˆ ํ•  ๋ฐ์ดํ„ฐ์…‹ ์ž‘์—… ๋ฌธ๋‹จ์—์„œ ๋” ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์ „ ์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋œ ๋ฐ์ดํ„ฐ์…‹ ํŒŒ์ผ์ด ๊ฐ™์ด ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ์ด์šฉํ•  ๊ฒฝ์šฐ ํ•ด๋‹น ๊ฐ€์ƒํ™˜๊ฒฝ ์„ค์ •์„ ๊ฑด๋„ˆ๋›ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณธ ํ”„๋กœ์ ํŠธ๋Š” conda๋ฅผ ํ†ตํ•œ ๊ฐ€์ƒํ™˜๊ฒฝ ์ œ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ผ๋ จ์˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์œ ํ‹ธ๋ฆฌํ‹ฐ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์›ํ™œํ•œ ์ž‘์—…์„ ์œ„ํ•ด conda ์‚ฌ์šฉ์„ ๊ถŒ์žฅ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

Preprocessing Dataset

์„ค๋ฌธ์กฐ์‚ฌ ๋ฐ์ดํ„ฐ์— ํฌํ•จ๋œ ์ž์—ฐ์–ด ํ•ญ๋ชฉ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ํ”„๋กœ์ ํŠธ๋Š” ํ•œ๊ตญ์–ด ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • SimCSE (https://github.com/BM-K/KoSimCSE-SKT)
  • Averaged BERT Input Embeddings ํ•œ๊ตญ์–ด KoSimCSE ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด, ๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ ํ†ตํ•ด ํ™˜๊ฒฝ์„ค์ •์„ ์ง„ํ–‰ํ•ด ์ฃผ์‹ญ์‹œ์˜ค.
$ conda activate transformers_2_8_0
$ git clone https://github.com/BM-K/KoSimCSE.git
$ cd KoSimCSE
$ git clone https://github.com/SKTBrain/KoBERT.git
$ cd KoBERT
$ pip install -r requirements.txt
$ pip install .
$ cd ..
$ pip install -r requirements.txt

๋ช…๋ น์–ด ์‹คํ–‰์ด ์™„๋ฃŒ๋˜์—ˆ๋‹ค๋ฉด Prerequisites ๋ฌธ๋‹จ์—์„œ ๋ช…์‹œ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด ์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค. ์ด ๋•Œ transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ๋ฒ„์ „์€ 2.8.0์„ ๋งŒ์กฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฒ„์ „์ด ์ผ์น˜ํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ ํ•ด๋‹น ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ญ์ œ ํ›„ ์žฌ์„ค์น˜๋ฅผ ๊ถŒ์žฅ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

transformers ๋ฒ„์ „์ด ์„œ๋กœ ์ƒ์ดํ•˜๊ธฐ ๋•Œ๋ฌธ์—, KoBERT์™€ KoSimCSE๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ์‹คํ–‰๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. KoSimCSE/KoBERT/kobert/pytorch_bert.py์˜ 28๋ฒˆ์งธ ์ค„

bertmodel = BertModel.from_pretrained(model_path), return_dict=False)

๋ฅผ

bertmodel = BertModel.from_pretrained(model_path))

๋กœ ์ˆ˜์ •ํ•˜์—ฌ ์ฃผ์‹ญ์‹œ์˜ค.

์ดํ›„ ํ•ด๋‹น ๋ ˆํฌ์ง€ํ† ๋ฆฌ์˜ README ๋ฌธ์„œ์—์„œ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ๋‹ค์šด๋กœ๋“œ๋ฐ›์•„ ์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค. ๋‹ค์šด๋กœ๋“œ๋œ ํŒŒ์ผ (nli_checkpoint.pt)์€ KoSimCSE ํด๋” ๋‚ด์— ์œ„์น˜ํ•ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด์ œ ๋Œ€ํšŒ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค์šด๋กœ๋“œํ•ด ์ฃผ์‹ญ์‹œ์˜ค. ๋‹ค์šด๋กœ๋“œ๋œ ํŒŒ์ผ๊ณผ KoSimCSE ํด๋”๋ฅผ ๋ชจ๋‘ ํ”„๋กœ์ ํŠธ ๊ฒฝ๋กœ์˜ res ํด๋”์˜ ํ•˜์œ„๋กœ ์˜ฎ๊ฒจ์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค. ์ดํ›„ ๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ ์ˆ˜ํ–‰ํ•ด ์ฃผ์‹ญ์‹œ์˜ค.

$ bash utilities/preprocess.sh

ํ˜น์‹œ ํ™˜๊ฒฝ์„ค์ •์— ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์‹œ, utilities/preprocess.sh ํŒŒ์ผ์˜ ์ฒซ ์ค„์— ์œ„์น˜ํ•œ

source ~/anaconda3/etc/profile.d/conda.sh

๋ช…๋ น์–ด์—์„œ ํ˜„์žฌ ์‹œ์Šคํ…œ์˜ ์•„๋‚˜์ฝ˜๋‹ค ๊ฒฝ๋กœ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์ฃผ์‹ญ์‹œ์˜ค.

์‹คํ–‰์ด ์™„๋ฃŒ๋˜์—ˆ๋‹ค๋ฉด, ๋‹ค์Œ์˜ ํŒŒ์ผ์ด ์ƒ์„ฑ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•ด ์ฃผ์„ธ์š”. ํ•ด๋‹น ํŒŒ์ผ์„ ๋ชจ๋‘ res ํด๋”๋กœ ์˜ฎ๊ธฐ๋ฉด ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒ˜๋ฆฌ๋Š” ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค.

  • KNOW_2017.pkl
  • KNOW_2018.pkl
  • KNOW_2019.pkl
  • KNOW_2020.pkl
  • KNOW_2017_test.pkl
  • KNOW_2018_test.pkl
  • KNOW_2019_test.pkl
  • KNOW_2020_test.pkl

ํ˜น์€ ์ฒจ๋ถ€๋œ, ์ „์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋œ ์œ„์˜ ํŒŒ์ผ๋“ค์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ํŒŒ์ผ๋“ค์€ ์œ„์˜ ๊ณผ์ •๊ณผ ๋™์ผํ•˜๊ฒŒ ์ง„ํ–‰๋˜์–ด ์ƒ์„ฑ๋œ ํŒŒ์ผ๋“ค์ž…๋‹ˆ๋‹ค.

Train the Models

๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ ์ž…๋ ฅํ•˜์‹ญ์‹œ์˜ค.

$ python src/train.py config/sid-512d-1tb-1ab-18l.yaml data.filename=res/KNOW_2017.pkl data.fold_index=0 data.num_folds=5 train.random_seed=42

์—ฐ๋„ ๋ฐ์ดํ„ฐ, KFold ๊ฐฏ์ˆ˜ ๋ฐ random seed๋ฅผ ๋ณ€๊ฒฝํ•˜๊ธฐ ์œ„ํ•ด์„œ ์œ„์˜ ๋ช…๋ น์–ด์— ๋ช…์‹œ๋œ ๊ฐ’์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜น์€ ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ 5-fold ํ•™์Šต์„ ์œ„ํ•ด ๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

$ bash utilities/train.sh data.random_seed=42

๋ณธ ํ”„๋กœ์ ํŠธ์™€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋‹ค์Œ์˜ random seed์— ๋Œ€ํ•œ ํ•™์Šต์„ ์ง„ํ–‰ํ•ด ์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค. ๋ฌผ๋ก  GPU์˜ random generation์ด ์ƒ์ดํ•˜์—ฌ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • data.random_seed=0
  • data.random_seed=1
  • data.random_seed=2
  • data.random_seed=3
  • data.random_seed=4
  • data.random_seed=20
  • data.random_seed=24
  • data.random_seed=42
  • data.random_seed=777
  • data.random_seed=1111
  • data.random_seed=1234
  • data.random_seed=2022
  • data.random_seed=9876
  • data.random_seed=9999
  • data.random_seed=65535

๊ฐ random seed๋ณ„ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ํ”„๋กœ์ ํŠธ ํด๋” ์ตœ์ƒ๋‹จ์— ์œ„์น˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ฐ seed๋ณ„ ์™„๋ฃŒ๋œ ๋ชจ๋ธ์„ ์ƒˆ๋กœ์šด ํด๋” (e.g. rs42)์— ๊ฒฉ๋ฆฌํ•ด ์ฃผ์‹ญ์‹œ์˜ค.

Predict the KNOW Codes

ํ•™์Šต์ด ์™„๋ฃŒ๋œ ๋ชจ๋ธ, ํ˜น์€ ๋™๋ด‰๋œ ๊ฐ€์ค‘์น˜ ํŒŒ์ผ๋“ค์ด ๊ฒฉ๋ฆฌ๋˜์–ด ์žˆ๋Š” ๊ฒฝ๋กœ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์˜ˆ์ธก ํŒŒ์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

$ bash utilities/predict-test.sh ./rs42

๋ชจ๋“  random seed์— ๋Œ€ํ•ด ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•œ ๋’ค, ํด๋” ๋‚ด์— ์˜ˆ์ธก๋œ .csv ํŒŒ์ผ๋“ค์ด ์ƒ์„ฑ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค์„ ํ•˜๋‚˜๋กœ ๊ฒฐํ•ฉํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

$ python utilities/create_submission.py **/*.csv --merge --ensemble

๋ชจ๋“  ๊ณผ์ •์ด ์™„๋ฃŒ๋˜์—ˆ๋‹ค๋ฉด, submission-ensemble.csv ํŒŒ์ผ์ด ์ƒ์„ฑ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค. ํ•ด๋‹น ํŒŒ์ผ์ด ์ œ์ถœ์— ์‚ฌ์šฉ๋œ submission์ž…๋‹ˆ๋‹ค.