Training a GPT-2 Model for Code Generation

This Python script trains a GPT-2 language model for code generation. It uses the Hugging Face's Transformers library for the GPT-2 model and tokenizer, and the Datasets library for handling the dataset.

Steps:

It starts by training a ByteLevelBPETokenizer on the provided text data file and saves the tokenizer model to disk. CUDA device order and visible devices are set according to your environment configuration.
The trained tokenizer is then assigned to the GPT2Tokenizer and special tokens are added.
A GPT-2 model is initialized with the GPT2Config class, using the vocab size and special tokens from the tokenizer.
The dataset is loaded from the provided paths and transformed using the tokenizer.
The data collator is set to DataCollatorForLanguageModeling from the Transformers library with masked language modeling enabled.
Training arguments are set using the TrainingArguments class from Transformers, including output directory, number of epochs, batch size, save steps, etc.
Finally, a Trainer is initialized with the model, training arguments, data collator, and dataset, and is ready to be trained.

Note

This script uses PyTorch through the Transformers library. You need to have a compatible CUDA version installed if you wish to train the model on a GPU.

Check for nvidia

nvidia-smi

My version:

+-------------------+
|CUDA Version: 12.0 |
+-------------------+

Example of input

Output of that for code suggestion/completion

Interpretations

The signifies new line. On given import the model suggested the most viable imports it learned during training.

Key note

It is used as a proof of concept, and has been trained by checkpoints for multiple times, on multiple GPUs, and has been checked time to time before moving it on large scale. It is advised to do the same. The fine tuning on gpt2 itself is done to prototype with smaller models. But, as it can be seen that the results are very good, and improvement has been shown over time by the model during training.

Data and model

The dataset is provided, and scarpped by Advanced Engineering School (AES) of Innopolis University with whom the project is assosiated. Hence, data, and model/checkpoints are not open source. One can still scrap the data from github to prototype for learning purposes. The data.py file is provided as an example of how to do so.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
data.py		data.py
encode.py		encode.py
requirements.txt		requirements.txt
test.py		test.py
tokenizer.py		tokenizer.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training a GPT-2 Model for Code Generation

Steps:

Note

Check for nvidia

Example of input

Output of that for code suggestion/completion

Interpretations

Key note

Data and model

About

Releases

Packages

Contributors 2

Languages

Jayveersinh-Raj/code_generation_gpt2

Folders and files

Latest commit

History

Repository files navigation

Training a GPT-2 Model for Code Generation

Steps:

Note

Check for nvidia

Example of input

Output of that for code suggestion/completion

Interpretations

Key note

Data and model

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages