Skip to content

Latest commit

 

History

History
99 lines (77 loc) · 6.22 KB

OFFLINE_PLOTS.md

File metadata and controls

99 lines (77 loc) · 6.22 KB

Offline Plots

This page shows all plots related to offline evaluation. Use the following menu to navigate to the plots:

Metric Explanation

Exact Match

The Exact Match metric assesses whether a prediction and its ground truth align perfectly, disregarding any leading or trailing whitespace. This yields a binary result, which is then expressed as a percentage.

Edit Similarity

The Edit Similarity calculates the number of character edits required for the ground truth to be an exact match of the prediction. It also introduces three techniques used for editing: insertion, deletion, and replacement. A penalty of 1 for wrong characters, too many characters, and too few characters, respectively. The similarity score is calculated by dividing the Levenshtein Distance by the length of the longest candidate. This results in a number between 0 and 1.

BLEU4

BLEU4, a variant of BLEU (BiLingual Evaluation Understudy), is an automated method first proposed to assess machine translation. BLEU4 compares the tokenized established facts with predictions, deriving a weighted sum of 1 to 4 N-grams with equal weights assigned to each N-gram. However, this metric fails if N-grams of every size are not present, an issue that often arises given that completions typically span only one line. To avoid this, we use smoothing methods. Chen et al. proposed and evaluated seven smoothing techniques to more accurately represent human judgment1 from which we use method 2 as it has been found to offer a better representation of the similarity between predictions and actual data by Shi et al.2 and used in the literature34.

METEOR

The METEOR (Metric for Evaluation of Translation with Explicit ORdering) score employs both precision and recall to evaluate the alignment of unigrams in the ground truth and prediction. Precision signifies the unigrams in the prediction that also appear in the ground truth, while recall represents the unigrams in the ground truth that are present in the prediction. METEOR grants a tenfold weight on precision compared to recall. Banerjee et al.5 argue that METEOR more accurately emulates human judgment than BLEU and Lavie et al.6 demonstrated that recall is more akin to human judgment than precision.

ROUGE-L

ROUGE-L is a variant of ROUGE (Recall-Oriented Understudy for Gisting Evaluation)7 which compares the tokenized ground truth with the prediction, evaluating it based on the longest common sub-sequence wherein each code token operates as a uni-gram. ROUGE-L calculates and employs both precision and recall to determine an F1 score.

Performance per Language

Exact Match

Random Test Set

Exact Match Performance per Language (Random)

Trigger Point Test Set

Exact Match Performance per Language (Trigger Point)

Edit Similarity

Random Test Set

Edit Similarity Performance per Language (Random)

Trigger Point Test Set

Edit Similarity Performance per Language (Trigger Point)

BLEU-4

Random Test Set

BLEU-4 Performance per Language (Random)

Trigger Point Test Set

BLEU-4 Performance per Language (Trigger Point)

METEOR

Random Test Set

METEOR Performance per Language (Random)

Trigger Point Test Set

METEOR Performance per Language (Trigger Point)

ROUGE-L

Random Test Set

ROUGE-L Performance per Language (Random)

Trigger Point Test Set

ROUGE-L Performance per Language (Trigger Point)

Performance per Trigger Point

Exact Match

Exact Match Performance per Trigger Point

Edit Similarity

Edit Similarity Performance per Trigger Point

BLEU-4

BLEU-4 Performance per Trigger Point

METEOR

METEOR Performance per Trigger Point

ROUGE-L

ROUGE-L Performance per Trigger Point

Footnotes

  1. B. Chen and C. Cherry, "A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU".

  2. E. Shi et al., "On the Evaluation of Neural Code Summarization".

  3. S. Lu et al., "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation"

  4. Y. Wang, W. Wang, S. Joty, and S. C. Hoi, "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation"

  5. S. Banerjee and A. Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments".

  6. A. Lavie, K. Sagae, and S. Jayaraman, "The significance of recall in automatic metrics for MT evaluation".

  7. C.-Y. Lin, "ROUGE: A Package for Automatic Evaluation of Summaries".