New: Paper accepted by ICSE 2022. Preprint at arXiv!
This repository contains code and pre-trained models for VarCLR, a contrastive learning based approach for learning semantic representations of variable names that effectively captures variable similarity, with state-of-the-art results on IdBench@ICSE2021.
- VarCLR: Variable Representation Pre-training via Contrastive Learning
pip install -e .
from varclr.models.model import Encoder
model = Encoder.from_pretrained("varclr-codebert")
emb = model.encode("squareslab")
print(emb.shape)
# torch.Size([1, 768])
emb = model.encode(["squareslab", "strudel"])
print(emb.shape)
# torch.Size([2, 768])
print(model.score("squareslab", "strudel"))
# [0.42812108993530273]
print(model.score(["squareslab", "average", "max", "max"], ["strudel", "mean", "min", "maximum"]))
# [0.42812108993530273, 0.8849745988845825, 0.8035818338394165, 0.889922022819519]
variable_list = ["squareslab", "strudel", "neulab"]
print(model.cross_score("squareslab", variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832]]
print(model.cross_score(variable_list, variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832],
# [0.4281214475631714, 1.0000004768371582, 0.549992561340332],
# [0.7207341194152832, 0.549992561340332, 1.000000238418579]]
from varclr.benchmarks import Benchmark
# Similarity on IdBench-Medium
b1 = Benchmark.build("idbench", variant="medium", metric="similarity")
# Relatedness on IdBench-Large
b2 = Benchmark.build("idbench", variant="large", metric="relatedness")
id1_list, id2_list = b1.get_inputs()
predicted = model.score(id1_list, id2_list)
print(b1.evaluate(predicted))
# {'spearmanr': 0.5248567181503295, 'pearsonr': 0.5249843473193132}
print(b2.evaluate(model.score(*b2.get_inputs())))
# {'spearmanr': 0.8012168379981921, 'pearsonr': 0.8021791703187449}
Let's compare with the original CodeBERT
codebert = Encoder.from_pretrained("codebert")
print(b1.evaluate(codebert.score(*b1.get_inputs())))
# {'spearmanr': 0.2056582946575104, 'pearsonr': 0.1995058696927054}
print(b2.evaluate(codebert.score(*b2.get_inputs())))
# {'spearmanr': 0.3909218857993804, 'pearsonr': 0.3378219622284688}
You can pretrain and get the same VarCLR model variants with the following code.
python -m varclr.pretrain --model avg --name varclr-avg
python -m varclr.pretrain --model lstm --name varclr-lstm
python -m varclr.pretrain --model bert --name varclr-codebert --sp-model split --last-n-layer-output 4 --batch-size 64 --lr 1e-5 --epochs 1
The training progress and test results will be presented in the wandb dashboard. For reference, our training curves look like the following:
Results on IdBench benchmarks
Method | Small | Medium | Large |
---|---|---|---|
FT-SG | 0.30 | 0.29 | 0.28 |
LV | 0.32 | 0.30 | 0.30 |
FT-cbow | 0.35 | 0.38 | 0.38 |
VarCLR-Avg | 0.47 | 0.45 | 0.44 |
VarCLR-LSTM | 0.50 | 0.49 | 0.49 |
VarCLR-CodeBERT | 0.53 | 0.53 | 0.51 |
Combined-IdBench | 0.48 | 0.59 | 0.57 |
Combined-VarCLR | 0.66 | 0.65 | 0.62 |
Method | Small | Medium | Large |
---|---|---|---|
LV | 0.48 | 0.47 | 0.48 |
FT-SG | 0.70 | 0.71 | 0.68 |
FT-cbow | 0.72 | 0.74 | 0.73 |
VarCLR-Avg | 0.67 | 0.66 | 0.66 |
VarCLR-LSTM | 0.71 | 0.70 | 0.69 |
VarCLR-CodeBERT | 0.79 | 0.79 | 0.80 |
Combined-IdBench | 0.71 | 0.78 | 0.79 |
Combined-VarCLR | 0.79 | 0.81 | 0.85 |
If you find VarCLR useful in your research, please cite our paper@ICSE2022:
@inproceedings{ChenVarCLR2022,
author = {Chen, Qibin and Lacomis, Jeremy and Schwartz, Edward J. and Neubig, Graham and Vasilescu, Bogdan and {Le~Goues}, Claire},
title = {{VarCLR}: {Variable} Semantic Representation Pre-training via Contrastive Learning},
booktitle = {International Conference on Software Engineering},
year = {2022},
series = {ICSE '22}
}