Skip to content

Latest commit

 

History

History
195 lines (135 loc) · 13.2 KB

README_EN.md

File metadata and controls

195 lines (135 loc) · 13.2 KB

한국어 | English

KoELECTRA

ELECTRA uses Replaced Token Detection, in other words, it learns by looking at the token from the generator and determining whether it is a "real" token or a "fake" token in the discriminator. This methods allows to train all input tokens, which shows competitive result compare to other pretrained language models (BERT etc.)

KoELECTRA is trained with 34GB Korean text, and I'm releasing KoELECTRA-Base and KoELECTRA-Small.

Also KoELECTRA uses Wordpiece and model is uploaded on s3, so just install the Transformers library and it will be ready to use regardless of the OS you use.

Download Link

Model Discriminator Generator Tensorflow-v1
KoELECTRA-Base-v1 Discriminator Generator Tensorflow-v1
KoELECTRA-Small-v1 Discriminator Generator Tensorflow-v1
KoELECTRA-Base-v2 Discriminator Generator Tensorflow-v1
KoELECTRA-Small-v2 Discriminator Generator Tensorflow-v1
KoELECTRA-Base-v3 Discriminator Generator Tensorflow-v1
KoELECTRA-Small-v3 Discriminator Generator Tensorflow-v1

About KoELECTRA

Layers Embedding Size Hidden Size # heads
KoELECTRA-Base Discriminator 12 768 768 12
Generator 12 768 256 4
KoELECTRA-Small Discriminator 12 128 256 4
Generator 12 128 256 4

Vocabulary

  • The main purpose of this project was to make the model immediately available with the Transformers library, therefore, instead of using the Sentencepiece and Mecab, the Wordpiece used in the original paper and code was used.
  • For more detail, see [Wordpiece Vocabulary]
Vocab Len do_lower_case
v1 32200 False
v2 32200 False
v3 35000 False

Data

  • For v1 and v2, 14G Corpus (2.6B tokens) was used. (News, Wiki, Namu Wiki)
  • For v3, 20G Corpus from Everyone's Corpus was additionally used. (Newspaper, written, spoken, messenger, web)

Pretraining Details

Model Batch Size Train Steps LR Max Seq Len Generator Size Train Time
Base v1,2 256 700K 2e-4 512 0.33 7d
Base v3 256 1.5M 2e-4 512 0.33 14d
Small v1,2 512 300K 5e-4 512 1.0 3d
Small v3 512 800K 5e-4 512 1.0 7d
  • In case of KoELECTRA-Small model, the same options as ELECTRA-Small++ in the original paper were used.

    • This is the same setting as the small model distributed by the official ELECTRA code.
    • Also, unlike KoELECTRA-Base, the model size of Generator and Disciminator is same.
  • Except for Batch size and Train steps, other hyperparameters are same as that of original paper.

    • I tried changing other hyperparameters and running them, but setting them as same as the original paper performed best.
  • TPU v3-8 was used for pretraining. More detail about using TPU on GCP, see [Using TPU for Pretraining]

KoELECTRA on 🤗 Transformers 🤗

  • ElectraModel is officially supported from Transformers v2.8.0.

  • ElectraModel is similar to BertModel except that it does not return pooled_output.

  • ELECTRA uses discriminator for finetuning.

1. Pytorch Model & Tokenizer

from transformers import ElectraModel, ElectraTokenizer

model = ElectraModel.from_pretrained("monologg/koelectra-base-discriminator")  # KoELECTRA-Base
model = ElectraModel.from_pretrained("monologg/koelectra-small-discriminator")  # KoELECTRA-Small
model = ElectraModel.from_pretrained("monologg/koelectra-base-v2-discriminator")  # KoELECTRA-Base-v2
model = ElectraModel.from_pretrained("monologg/koelectra-small-v2-discriminator")  # KoELECTRA-Small-v2
model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")  # KoELECTRA-Base-v3
model = ElectraModel.from_pretrained("monologg/koelectra-small-v3-discriminator")  # KoELECTRA-Small-v3

2. Tensorflow v2 Model

from transformers import TFElectraModel

model = TFElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator", from_pt=True)

3. Tokenizer Example

>>> from transformers import ElectraTokenizer
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer.tokenize("[CLS] 한국어 ELECTRA를 공유합니다. [SEP]")
['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]'])
[2, 11229, 29173, 13352, 25541, 4110, 7824, 17788, 18, 3]

Result on Subtask

This is the result of running with the config as it is, and if hyperparameter tuning is additionally performed, better performance may come out.

For code and more detail, see [Finetuning]

Base Model

NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
Korean-Hate-Speech (Dev)
(F1)
KoBERT 89.59 87.92 81.25 79.62 81.59 94.85 51.75 / 79.15 66.21
XLM-Roberta-Base 89.03 86.65 82.80 80.23 78.45 93.80 64.70 / 88.94 64.06
HanBERT 90.06 87.70 82.95 80.32 82.73 94.72 78.74 / 92.02 68.32
KoELECTRA-Base 90.33 87.18 81.70 80.64 82.00 93.54 60.86 / 89.28 66.09
KoELECTRA-Base-v2 89.56 87.16 80.70 80.72 82.30 94.85 84.01 / 92.40 67.45
KoELECTRA-Base-v3 90.63 88.11 84.45 82.24 85.53 95.25 84.83 / 93.45 67.61

Small Model

NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
Korean-Hate-Speech (Dev)
(F1)
DistilKoBERT 88.60 84.65 60.50 72.00 72.59 92.48 54.40 / 77.97 60.72
KoELECTRA-Small 88.83 84.38 73.10 76.45 76.56 93.01 58.04 / 86.76 63.03
KoELECTRA-Small-v2 88.83 85.00 72.35 78.14 77.84 93.27 81.43 / 90.46 60.14
KoELECTRA-Small-v3 89.36 85.40 77.45 78.60 80.79 94.85 82.11 / 91.13 63.07

Updates

April 27, 2020

  • Add two additional subtasks (KorSTS, QuestionPair), and the results were updated for the existing 5 subtasks.

June 3, 2020

  • KoELECTRA-v2 is released for both base and small model, which is trained with new vocabulary that is used in EnlipleAI PLM. Both Base and Small models showed improved performance in KorQuaD.

October 9, 2020

  • KoELECTRA-v3 was produced by additionally using Everyone's Corpus. Vocab was also newly produced using Mecab and Wordpiece.
  • In consideration of the official support of ElectraForSequenceClassification of Huggingface Transformers, the existing subtask results have been updated. Also the result of Korean-Hate-Speech is added.
from transformers import ElectraModel, ElectraTokenizer

model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

May 26, 2021

  • Fix the issue that model can't loaded on torch<=1.4 (Re-uploaded model after fix) (Related Issue)
  • Upload tensorflow v2 model on huggingface hub (tf_model.h5)

Oct 20, 2021

  • Remove tf_model.h5 on huggingface hub. (Use from_pt=True)

Acknowledgement

KoELECTRA was created with Cloud TPU support from the Tensorflow Research Cloud (TFRC) program. Also, KoELECTRA-v3 was produced with the help of Everyone's Corpus.

Citation

If you use this code for research, please cite:

@misc{park2020koelectra,
  author = {Park, Jangwon},
  title = {KoELECTRA: Pretrained ELECTRA Model for Korean},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/monologg/KoELECTRA}}
}

Reference