Evaluation of CodeBertScore

This folder contains the full evaluation pipeline on the correlation with functional correctness

cd evaluation
LANG=java # cpp, python, js
MODEL_LANG=java # if LANG is js, use javascript
LAYER=7

1. Data preparation

We construct multilingual humaneval dataset from multipl-E and humaneval-x

python process_data.py \
    --lang LANG \
    --config davinci-0.8-keep

This script will take

generation results provided by multipl-e (example)
reference code in the corresponding language from humaneval-x (example) and construct text file of source, reference and target (example)

2. Calculate CodeBertScore

python run_score.py \
    --lang $LANG \
    --model neulab/codebert-$MODEL_LANG \
    --device cuda:0 \
    --d_folder data/humaneval_$LANG_davinci-0.8-keep \
    --d_prefix humaneval \
    --idf_path data/idf/$LANG_idf.pkl \
    --layer $LAYER

The detailed configurations for each language are provided here

3. Calculate correlation with functional correctness

python calculate_correlation.py \
    --lang $LANG \
    --d_folder data/humaneval_$LANG_davinci-0.8-keep \
    --d_prefix humaneval \
    --result_file humaneval_codebert-$MODEL_LANG_L$LAYER_idf.score.json

It will output the kental-tau, spearman and pearson correlation with functional correctness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Evaluation of CodeBertScore

1. Data preparation

2. Calculate CodeBertScore

3. Calculate correlation with functional correctness

Files

README.md

Latest commit

History

README.md

File metadata and controls

Evaluation of CodeBertScore

1. Data preparation

2. Calculate CodeBertScore

3. Calculate correlation with functional correctness