CoTran: An LLM-based Code Translator using Reinforcement Learning with Feedback from Compiler and Symbolic Execution

Accepted for publication at ECAI-2024 (27th European Conference on Artificial Intelligence), 19-24 October 2024, Santiago de Compostela, Spain

Pre-print and supplementary available at: https://arxiv.org/abs/2306.06755

Abstract

In this paper, we present an LLM-based code translation method and an associated tool called CoTran, that translates whole-programs from one high-level programming language to another. Existing LLM-based code translation methods lack training to ensure that the translated code reliably compiles or bears substantial functional equivalence to the input code. In our work, we fine-tune an LLM using reinforcement learning, incorporating compiler feedback, and symbolic execution (symexec)-based testing feedback to assess functional equivalence between the input and output programs. The idea is to guide an LLM during fine-tuning, via compiler and symexec-based testing feedback, helping it judge how far it is from producing perfect translations. We conduct extensive experiments comparing CoTran with 14 other code translation tools, including human-written transpilers, LLM-based translation tools, and ChatGPT. Using a benchmark of over 57,000 code pairs in Java and Python, we demonstrate that CoTran outperforms the other tools on relevant metrics such as compilation accuracy (CompAcc) and functional equivalence accuracy (FEqAcc). For example, in Python-to-Java translation, CoTran achieves 48.68% FEqAcc and 76.98% CompAcc, whereas the nearest competing tool (PLBART-base) gets 38.26% and 75.77% respectively. Additionally, built upon CodeT5, CoTran improves FEqAcc by +12.94% and +14.89%, and CompAcc by +4.30% and +8.14% for Java-to-Python and Python-to-Java translations, respectively.

Files

This repository contains the main paper (CoTran_main.pdf), the appendix (CoTran_appendix.pdf), the AVATAR-TC dataset, and all our codes. We have made a significant effort to make it easy for the user to run the code. This README file details the folder structure, library dependencies, and steps for running the code. The repository also includes the generated P2J and J2P translations obtained through the SoTA methods and CoTran variants.

Dependencies

The repository is developed in Python and the following versions were used:

java/14.0.2
mono/6.12.0.122
python/3.10
arrow/13.0.0
rust/1.70.0

For all the Python libraries, refer the requirements file provided in the repository. You can also directly setup a virtual environment by the following setup steps.

Setup

For CoTran (baseline), CoTran⁺, CoTran^x:

virtualenv --no-download --clear ~/cotran
source ~/cotran/bin/activate
pip install -r requirements_cotran.txt
deactivate

For CoTran+CF, CoTran+CF+SF:

virtualenv --no-download --clear ~/cotranRL
source ~/cotranRL/bin/activate
pip install -r requirements_cotranRL.txt
deactivate

Training & Evaluating the CoTran (baseline), CoTran⁺, CoTran^x Models

All the codes for CoTran (baseline) are implemented in Pytorch Lightning and supports multi-GPU as well as multi-node training. During our experimentation, we tried on two compute nodes, each comprising four NVIDIA V100 GPUs with 32GB memory and six CPU cores per GPU. This could fit a batch size of 8x8 = 64 during training.

Change the appropriate hyperparameters, number of compute nodes, number of GPUs, batch size, src language, target language, etc. in the shell script ./CoTranBaseline_run_script_TRAIN.sh.
Run ./CoTranBaseline_run_script_TRAIN.sh for training the CoTran (baseline) Java-to-Python (J2P) and Python-to-Java (P2J) LLMs
Run ./CoTranBaseline_run_script_TEST.sh to evaluate the models, after editing the working directory folder in the script
The working directory will contain the best J2P and P2J models, along with the corresponding tokenizer folder.
Move the best J2P and P2J models to path ./FINETUNED_MODELS/java2python/bestModel.ckpt and ./FINETUNED_MODELS/python2java/bestModel.ckpt respectively
Move the respective tokenizers to path ./FINETUNED_MODELS/java2python/tokenizer and ./FINETUNED_MODELS/python2java/tokenizer respectively

Training & Evaluating the CoTran+CF and CoTran+CF+SF Models

For training CoTran + CF (RL only), run ./language_translation_RLonly_CF_runSCRIPT.sh
For training CoTran + CF (RL + SFT interleaved training), run ./language_translation_RLSFT_CF_runSCRIPT.sh
For training CoTran + CF + SF (RL + SFT interleaved training) i.e. back-to-back LLMs, run ./language_translation_RLSFT_CFSF_runSCRIPT.sh
For training CoTran + CF + SF (RL only) just edit ./language_translation_RLSFT_CFSF.py to comment out the SFT function-call and then, run ./language_translation_RLSFT_CFSF_runSCRIPT.sh

Each of these will write models in the working directory, and the training can be observed by connecting to WanDB.

Pre-computed Results

For the translation results of J2P and P2J obtained by three human-written transpilers (java2python, TSS Code Converter, py2java), refer folder ./transpilers/. This folder also contains the scripts to execute the transpiler on the AVATAR-TC dataset.
For the translation results of J2P and P2J obtained by all the recent state-of-the-art methods, refer folder ./SoTA-results/. It contains the result files and the metric calculation logs for (a) three SoTA LLM-based unsupervised translation tools i.e. TransCoder, TransCoder-DOBF, TransCoder-ST (trained on function pairs from 2.5M open-sourced repositories of the GitHub dataset from Google BigQuery Public Datasets), (b) ChatGPT and, (c) seven LLM-based supervised translation tools i.e. CodeBERT, GraphCodeBERT, CodeGPT, CodeGPT-adapted, PLBART-base, CodeT5-base, PPOCoder
For the translation results of J2P and P2J obtained by the proposed method, CoTran and its variants, refer to folder ./proposed-results/. It contains the result files and the metric calculation logs for all our proposed CoTran methods.

AVATAR-TC Dataset

The paper introduces a new dataset AVATAR-TC (built on top of the AVATAR) that has pairs of whole-programs in Java and Python (a statically- and dynamically-typed language, with different syntactic styles), each accompanied by human-written test-cases (TCs). To the best of our knowledge, AVATAR-TC is the first such large-scale dataset where code compilability (syntactical correctness) is ensured, and code pairs have undergone thorough testing w.r.t. human-written TCs.

We use a collection of codes written in Java and Python from five contest websites: Aizu, AtCoder, Codeforces, Google-CodeJam, LeetCode, and two coding platforms: GeeksForGeeks, ProjectEuler.

The codes are parsed into code-specific tokens by javalang and tokenize module. Additionally, we collected test-cases for each of the problems by web-crawling the data sources. Any code that did not match the expected output on supplying the test-case inputs was manually corrected for minor faults, while the ones with major issues were discarded. Output matching is case-insensitive, ignores whitespaces, disregards punctuations (only when they are a minor portion of the output) and takes numeric or floating-point values to a common representation.

The AVATAR-TC dataset (with all Java-Python code pairs and test-cases) is available at: ./AVATAR-TC/

For each of the partition (train/validation/test), there is a file for Java, a file for Python and a file for the problem ID. The folder structure for the code pairs is as follows:

./AVATAR-TC/
├── test.java-python.id
├── test.java-python.java
├── test.java-python.python
├── train.java-python.id
├── train.java-python.java
├── train.java-python.python
├── valid.java-python.id
├── valid.java-python.java
└── valid.java-python.python

Additionally, there are .json files corresponding to each sub-dataset, containing human-written test-cases (TC) i.e. input-output for each Java-Python code pair.

./AVATAR-TC/
├── io_testcases_aizu.json
├── io_testcases_atcoder.json
├── io_testcases_codeforces.json
├── io_testcases_codejam.json
├── io_testcases_geeksforgeeks.json
├── io_testcases_leetcode.json
└── io_testcases_projecteuler.json

Citation

If you find the paper or this repository useful, please cite it with:

@inproceedings{jana2024cotran,
  title = {{CoTran: An LLM-based Code Translator using Reinforcement Learning with Feedback from Compiler and Symbolic Execution}},
  author = {Jana, Prithwish and Jha, Piyush and Ju, Haoyang and Kishore, Gautham and Mahajan, Aryan and Ganesh, Vijay},
  booktitle = {Proceedings of the 27th European Conference on Artificial Intelligence (ECAI-2024)},
  year = {2024},
  location = {Santiago de Compostela, Spain},
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
AVATAR-TC		AVATAR-TC
AVATAR_data		AVATAR_data
ChatGPT		ChatGPT
CodeGen		CodeGen
CodeXGLUE		CodeXGLUE
LexersParsers		LexersParsers
SoTA-results		SoTA-results
extras		extras
proposed-results		proposed-results
symflower_related		symflower_related
transpilers		transpilers
.gitignore		.gitignore
CoTranBaseline_run_script_TEST.sh		CoTranBaseline_run_script_TEST.sh
CoTranBaseline_run_script_TRAIN.sh		CoTranBaseline_run_script_TRAIN.sh
CoTran_appendix.pdf		CoTran_appendix.pdf
CoTran_main.pdf		CoTran_main.pdf
LT_utils_CFLoss.py		LT_utils_CFLoss.py
README.md		README.md
language_translation_ParallelTrain.py		language_translation_ParallelTrain.py
language_translation_RLSFT_CF.py		language_translation_RLSFT_CF.py
language_translation_RLSFT_CFSF.py		language_translation_RLSFT_CFSF.py
language_translation_RLSFT_CFSF_runSCRIPT.sh		language_translation_RLSFT_CFSF_runSCRIPT.sh
language_translation_RLSFT_CF_runSCRIPT.sh		language_translation_RLSFT_CF_runSCRIPT.sh
language_translation_RLonly_CF.py		language_translation_RLonly_CF.py
language_translation_RLonly_CF_runSCRIPT.sh		language_translation_RLonly_CF_runSCRIPT.sh
language_translation_RLrewards.py		language_translation_RLrewards.py
language_translation_Tokenizer.py		language_translation_Tokenizer.py
requirements_cotran.txt		requirements_cotran.txt
requirements_cotranRL.txt		requirements_cotranRL.txt
requirements_cotranRL_createEnv.sh		requirements_cotranRL_createEnv.sh
requirements_cotran_createEnv.sh		requirements_cotran_createEnv.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoTran: An LLM-based Code Translator using Reinforcement Learning with Feedback from Compiler and Symbolic Execution

Abstract

Files

Dependencies

Setup

Training & Evaluating the CoTran (baseline), CoTran⁺, CoTran^x Models

Training & Evaluating the CoTran+CF and CoTran+CF+SF Models

Pre-computed Results

AVATAR-TC Dataset

Citation

About

Releases

Packages

Languages

PrithwishJana/CoTran

Folders and files

Latest commit

History

Repository files navigation

CoTran: An LLM-based Code Translator using Reinforcement Learning with Feedback from Compiler and Symbolic Execution

Abstract

Files

Dependencies

Setup

Training & Evaluating the CoTran (baseline), CoTran+, CoTranx Models

Training & Evaluating the CoTran+CF and CoTran+CF+SF Models

Pre-computed Results

AVATAR-TC Dataset

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Training & Evaluating the CoTran (baseline), CoTran⁺, CoTran^x Models

Packages