Implementation of ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation (Finding of EMNLP 2022).
@inproceedings{limkonchotiwat-etal-2022-congen,
title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation",
author = "Limkonchotiwat, Peerat and
Ponwitayarat, Wuttikorn and
Lowphansirikul, Lalita and
Udomcharoenchaikit, Can and
Chuangsuwanich, Ekapol and
Nutanong, Sarana",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
year = "2022",
publisher = "Association for Computational Linguistics",
}
- We have a new version of ConGen: SCT (published at TACL2023).
- The SCT method outperforms ConGen on distillation settings.
- This method is also effective for a small model to learn sentence embedding without the teacher model!
git clone https://github.com/KornWtp/ConGen.git
cd ConGen
pip install -e .
- ConGen-BERT-Tiny
- ConGen-BERT-Mini
- ConGen-TinyBERT-L4
- ConGen-MiniLM-L3
- ConGen-MiniLM-L6
- ConGen-BERT-Small
- ConGen-MiniLM-L12
- ConGen-TinyBERT-L6
- ConGen-BERT-base
- ConGen-RoBERTa-base
- ConGen-Multilingual-DistilBERT
- ConGen-Multilingual-MiniLM-L12
We use the training data from BSL's paper: monolingual version and multilingual version.
We use sts-b development set from sentence transformer.
The full model parameters:
Models | Teacher Temp | Student Temp | Queue Size | Learning Rate |
---|---|---|---|---|
BERT-Tiny | 0.05 | 0.05 | 16384 | 5e-4 |
BERT-Mini | 0.05 | 0.07 | 16384 | 3e-4 |
Tiny-BERT-L4 | 0.05 | 0.05 | 65536 | 1e-4 |
MiniLM-L3 | 0.05 | 0.07 | 16384 | 5e-4 |
MiniLM-L6 | 0.05 | 0.07 | 65536 | 3e-4 |
BERT-Small | 0.05 | 0.07 | 65536 | 3e-4 |
MiniLM-L12 | 0.05 | 0.07 | 16384 | 5e-5 |
Tiny-BERT-L6 | 0.05 | 0.07 | 65536 | 5e-5 |
BERT-base | 0.05 | 0.07 | 65536 | 5e-5 |
RoBERTa-base | 0.1 | 0.1 | 1024 | 5e-5 |
Multilingual-DistilBERT | 0.05 | 0.07 | 65536 | 3e-4 |
Multilingual-MiniLM-L12 | 0.05 | 0.07 | 65536 | 3e-4 |
Please set the model's parameter before training.
>> bash train_congen.sh
For finetuning model parameters:
learning_rate_all=(3e-4 5e-4 1e-4 3e-5 5e-5 1e-5)
queue_sizes=(262144 131072 65536 16384 1024)
teacher_temps=(0.01 0.03 0.05 0.07 0.09 0.1)
student_temps=(0.01 0.03 0.05 0.07 0.09 0.1)
Our evaluation code for sentence embeddings is based on a modified version of SentEval and SimCSE.
Before evaluation, please download the evaluation datasets by running
cd SentEval/data/downstream/
bash download_dataset.sh
Please see https://github.com/KornWtp/ConGen/tree/main/notebook
Then come back to the root directory, you can evaluate any sentence transformers
models using SimCSE evaluation code. For example,
python evaluation.py \
--model_name_or_path "your-model-path" \
--task_set sts \
--mode test
In our paper, we average score over three models and shown as follows:
Methods | Semantic Textual Similarity (STS) average scores | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
BERT Tiny |
BERT Mini |
Tiny BERT-L4 |
MiniLM L3 |
MiniLM L6 |
BERT Small |
MiniLM L12 |
Tiny BERT-L6 |
BERT Base |
RoBERTa Base |
|
#Param (M) | 4 | 11 | 14 | 17 | 22 | 29 | 33 | 67 | 109 | 125 |
Finetuning-based | ||||||||||
Teacher | SimCSE-Unsup-RoBERTa-large: 78.90 | |||||||||
Sup-SimCSE | 72.35 | 76.52 | 78.19 | 76.49 | 78.86 | 78.59 | 80.48 | 81.23 | 81.57 | 82.52 |
Unsup-SimCSE | 64.47 | 65.94 | 67.91 | 55.10 | 59.15 | 69.13 | 67.90 | 73.67 | 76.25 | 77.10 |
Distillation-based | ||||||||||
L2 | 73.32 | 76.07 | 77.03 | 76.66 | 77.51 | 77.30 | 78.79 | 78.95 | 78.97 | 79.00 |
Making | 70.76 | 74.42 | 76.39 | 75.34 | 74.74 | 76.92 | 76.91 | 78.67 | 78.07 | 79.06 |
SKD | 68.83 | 72.02 | 73.05 | 72.66 | 73.59 | 75.06 | 74.58 | 77.62 | 78.05 | 77.44 |
CKD | 76.19 | 76.59 | 77.48 | 77.14 | 77.90 | 76.97 | 77.92 | 78.29 | 78.54 | 78.34 |
Our propose method | ||||||||||
ConGen | 76.85 | 78.09 | 78.54 | 78.22 | 79.10 | 78.91 | 79.68 | 79.73 | 80.06 | 79.78 |
Models | STS-12 | STS-13 | STS-14 | STS-15 | STS-16 | STS-B | SICK-R | Avg. |
---|---|---|---|---|---|---|---|---|
BERT-Tiny | 72.18 | 81.12 | 75.45 | 83.22 | 77.89 | 79.03 | 69.05 | 76.85 |
BERT-Mini | 74.17 | 82.69 | 76.58 | 84.30 | 78.23 | 80.84 | 69.82 | 78.09 |
Tiny-BERT-L4 | 74.3 | 83.07 | 77.37 | 84.70 | 79.06 | 80.99 | 70.26 | 78.54 |
MiniLM-L3 | 74.00 | 82.93 | 76.58 | 84.35 | 78.57 | 81.00 | 70.09 | 78.22 |
MiniLM-L6 | 75.06 | 83.86 | 77.29 | 85.01 | 79.67 | 81.92 | 70.89 | 79.10 |
BERT-Small | 74.50 | 83.58 | 77.29 | 84.83 | 79.72 | 81.93 | 70.55 | 78.91 |
MiniLM-L12 | 75.25 | 84.61 | 78.27 | 85.51 | 80.52 | 82.32 | 71.32 | 79.68 |
Tiny-BERT-L6 | 75.53 | 84.76 | 78.33 | 85.72 | 80.42 | 82.25 | 71.12 | 79.73 |
BERT-base | 75.58 | 85.13 | 78.54 | 85.75 | 81.12 | 82.81 | 71.47 | 80.06 |
RoBERTa-base | 75.32 | 84.56 | 77.26 | 85.33 | 81.34 | 82.67 | 72.00 | 79.78 |
- Unsupervised learning: ConGen-simcse-model-roberta-base-thai. Teacher model: simcse-model-roberta-base-thai. Student model: WangchanBERTa
- Weakly supervised learning: ConGen-paraphrase-multilingual-mpnet-base-v2. Teacher model: paraphrase-multilingual-mpnet-base-v2. Student model: WangchanBERTa
- Training data: we do backtranslatation from TH-to-EN-to-TH from scb_mt_enth_2020's model. The translation dataset: back translated machine translation of SCB
- We evaluate on two task benchmark tasks, such as Thai semantic textual similarity benchmark and Thai transfer benchmark
Parameters | Models | Teacher Temp | Student Temp | Queue Size | Learning Rate |
---|---|---|---|---|---|
<30M | ConGen-WangchanBERT-Tiny | 0.01 | 0.01 | 65536 | 3e-4 |
ConGen-WangchanBERT-Small | 0.05 | 0.09 | 65536 | 5e-4 | |
>100M | ConGen-simcse-model-roberta-base-thai | 0.05 | 0.03 | 65536 | 3e-4 |
ConGen-paraphrase-multilingual-mpnet-base-v2 | 0.05 | 0.05 | 262144 | 1e-4 |
Parameters | Models | Spearman's Correlation (*100) |
---|---|---|
<30M | ConGen-WangchanBERT-Tiny | 66.43 |
ConGen-WangchanBERT-Small | 70.65 | |
>100M | ConGen-simcse-model-roberta-base-thai | 66.21 |
ConGen-paraphrase-multilingual-mpnet-base-v2 | 76.56 |
Parameters | Models | Acc (*100) | F1 (*100, weighted) |
---|---|---|---|
<30M | ConGen-WangchanBERT-Tiny | 61.55 | 62.19 |
ConGen-WangchanBERT-Small | 64.77 | 65.30 | |
>100M | ConGen-simcse-model-roberta-base-thai | 65.07 | 65.28 |
ConGen-paraphrase-multilingual-mpnet-base-v2 | 67.84 | 68.31 |
Parameters | Models | Acc (*100) | F1 (*100, weighted) |
---|---|---|---|
<30M | ConGen-WangchanBERT-Tiny | 42.67 | 44.78 |
ConGen-WangchanBERT-Small | 43.38 | 45.99 | |
>100M | ConGen-simcse-model-roberta-base-thai | 41.32 | 41.57 |
ConGen-paraphrase-multilingual-mpnet-base-v2 | 47.22 | 48.63 |
Parameters | Models | Acc (*100) | F1 (*100, weighted) |
---|---|---|---|
<30M | ConGen-WangchanBERT-Tiny | 54.26 | 52.69 |
ConGen-WangchanBERT-Small | 58.22 | 57.03 | |
>100M | ConGen-simcse-model-roberta-base-thai | 49.81 | 47.94 |
ConGen-paraphrase-multilingual-mpnet-base-v2 | 58.00 | 56.80 |