This is the anonymous repository for blind review and this is no longer active. Please visit: https://github.com/csebuetnlp/CoDesc for updated code and dataset.
A large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.
This is the public release of code, and data of our paper titled "CoDesc: Large Code-Description Parallel Dataset", submitted to ACL, 2021.
Table of Contents
# clone this repository
git clone https://github.com/code-desc/CoDesc.git
# change permission of scripts
sudo chmod -R +x CoDesc
cd CoDesc/
# setup
./Setup/setup.sh
CoDesc is a noise removed, large parallel dataset of source codes and corresponding natural language descriptions. This dataset is procured from several similar, but noisy datasets including CodeSearchNet, FunCom, DeepCom, and CONCODE. We have developed and released the noise removal and preprocessing source codes along with the dataset. We also demonstrate the usefulness of CoDesc dataset in two popular tasks: natural language code search and source code summarization.
After initial setup described at Quickstart, our dataset will be downloaded at data/
folder along with preprocessed data for code search task and code summarization task. We also provide the source datasets here. Following are the links and descriptions of the dataset and preprocessed data.
-
CoDesc: This file contains our 4.2m dataset. The details of this dataset is given in our paper as well as in Dataset Description page.
-
Original_data: This file contains the source data from where we have collected and preprocessed our 4.2m dataset.
-
CSN_preprocessed_data: This file contains the preprocessed data for CodeSearchNet challenge. Here test and validation sets are the preprocessed datapoints from CodeSearchNet original test and validation sets.
-
CSN_preprocessed_data_balanced_partition: This file contains the preprocessed data for CodeSearchNet networks. Here train, test, and validation sets are from our balanced partition described in our paper
-
NCS_preprocessed_data: This file contains the preprocessed data for neural code summarization networks.
-
BPE_Tokenized_NCS_preprocessed_data: This file contains the preprocessed data for neural code summarization networks with BPE tokenization.
We have created a forked repository of Transcoder that facillicates parallel translation of source codes and speeds up the process by 16 times. Instructions to use Transcoder can be found in the above mentioned repository. The original work is published under the title "Unsupervised Translation of Programming Languages".
As we have already mentioned, we have provided the original data from sources to the data/original_data/
folder. To create the 4.2m CoDesc dataset from original data, the following command should be used.
python Dataset_Preparation/Merge_Datasets.py
The following command preprocesses CoDesc dataset for CodeSearchNet Challenge. It also preprocesses their validation and test sets using the filters defined in our paper.
python Dataset_Preparation/Preprocess_CSN.py
To create a balanced train-valid-test split for CodeSearchNet networks, the command can be used.
python Dataset_Preparation/Preprocess_CSN_Balanced_Partition.py
The following command preprocesses CoDesc dataset for NeuralCodeSum networks.
python Dataset_Preparation/Preprocess_NCS.py
To train and create tokenized files using bpe, use the following command.
python Tokenizer/huggingface_bpe.py
The tokenizers for source codes and natural language descriptions are given in the Tokenizer/
directory. To use the tokenizers in python, code_filter
and nl_filter
functions will have to be imported from Tokenizer/CodePreprocess_final.py
and Tokenizer/NLPreprocess_final.py
. Moreover, two json files named code_filter_flag.json
and nl_filter_flag.json
containing the options to preprocess code and description data will have to be present in the working directory. These two files must follow the formats given the Tokenizer/
folder. These flag options are also briefly described in the above mentioned json files.
The code for bpe tokenization is given at Tokenizer/huggingface_bpe.py
.
During the initial setup described at Quickstart, a forked version of CodeSearchNet is cloned into the working directory, and the preprocessed data of CoDesc will be copied to CodeSearchNet/resources/data/
directory. To use the preprocessed dataset of balanced partition, clear the above mentioned folder, and copy the content inside of data/csn_preprocessed_data_balanced_partition/
into it.
Then the following commands will train and test code search networks:
cd CodeSearchNet/
script/console
wandb login
python train.py --model neuralbowmodel --run-name nbow_CoDesc
python train.py --model rnnmodel --run-name rnn_CoDesc
python train.py --model selfattentionmodel --run-name attn_CoDesc
python train.py --model convolutionalmodel --run-name conv_CoDesc
python train.py --model convselfattentionmodel --run-name convattn_CoDesc
We used the original implementation of Code Summarization of NeuralCodeSum. Please refer to this guide for instructions on how to train the code summarization network.
Codes, dataset and models from CodeSearchNet, and NeuralCodeSum are used with the licenses provided at their respective repositories.
These codes, dataset, and preprocessed data are released under the MIT license.