This folder contains the full evaluation pipeline on the correlation with functional correctness
cd evaluation
LANG=java # cpp, python, js
MODEL_LANG=java # if LANG is js, use javascript
LAYER=7
We construct multilingual humaneval dataset from multipl-E and humaneval-x
python process_data.py \
--lang LANG \
--config davinci-0.8-keep
This script will take
- generation results provided by multipl-e (example)
- reference code in the corresponding language from humaneval-x (example) and construct text file of source, reference and target (example)
python run_score.py \
--lang $LANG \
--model neulab/codebert-$MODEL_LANG \
--device cuda:0 \
--d_folder data/humaneval_$LANG_davinci-0.8-keep \
--d_prefix humaneval \
--idf_path data/idf/$LANG_idf.pkl \
--layer $LAYER
The detailed configurations for each language are provided here
python calculate_correlation.py \
--lang $LANG \
--d_folder data/humaneval_$LANG_davinci-0.8-keep \
--d_prefix humaneval \
--result_file humaneval_codebert-$MODEL_LANG_L$LAYER_idf.score.json
It will output the kental-tau, spearman and pearson correlation with functional correctness.