StenoType

Type migration with large language models for code. Migrates JavaScript to TypeScript by predicting type annotations and generating type definitions.

This is the code repository for the dissertation Predicting TypeScript Type Annotations and Definitions With Machine Learning, specifically, Chapter 5.

The training dataset is on Hugging Face. Parts of the code may refer to it as ts-training-get4. This is a preprocessed version of ts-training, revision v1.1p1.

The final StenoType model is on Hugging Face. You will need to accept the agreement to access the model. The code and results may refer to this model as stenotype-7b-a6d445d-ckpt1000, as it was fine-tuned based on commit a6d445d.

There are two evaluation datasets: stenotype-eval-ts (also called stenotype-eval-dataset-subset in the code and TS-Sourced in the dissertation) and stenotype-eval-js (also called typeweaver-bundle-filtered-subset in the code and JS-Sourced in the dissertation). To type check the stenotype-eval-js dataset, you will also need to download the tarball from Hugging Face.

Figures and result summaries are in the results/ directory. Full results are on Hugging Face.

Instructions

Clone the repository:

git clone git@github.com:nuprl/StenoType.git
cd StenoType
git submodule update --init --recursive

Follow the instructions to set up miniconda.
Create a conda environment with Python 3.11 and install dependencies:

conda create -n gpu python=3.11
conda activate gpu
pip install -r requirements.txt
conda install -c conda-forge nodejs=20.8.1
npm install -g --no-save typescript@5.2.2

Download the StarCoderBase-7b and StenoType models:

a. Ensure that you have a Hugging Face account.

b. Accept the agreements for StarCoderBase-7b and StenoType.

c. On the command line, log into Hugging Face with huggingface-cli login.

d. In a directory of your choosing, e.g. ../models, run git clone git@hf.co:bigcode/starcoderbase-7b and git clone git@hf.co:nuprl/stenotype.

e. To save space, you can delete the .git directory (and possibly pytorch_model*.bin if model*.safetensors already exists).
Accept the agreement for the ts-eval evaluation dataset.
Now you can run the experiments:

# See what configurations can be wron
python src/main.py --show_configs

# To run inference on config 0 (this is very slow):
python src/main.py --infer 0

# To evaluate (this is CPU-bound):
python src/main.py --evaluate

# To generate dataset-level summaries (this is pretty fast):
python src/main.py --summarize

To browse the results, you can use the viewer. Type "help" for help.

python src/viewer.py --dataset path/to/results/dataset.parquet

Dependencies

git
Python 3

Using Conda or virtual environments is recommended.

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
finetune		finetune
results		results
src		src
util		util
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
mypy.ini		mypy.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StenoType

Instructions

Dependencies

About

Releases

Packages

Languages

nuprl/StenoType

Folders and files

Latest commit

History

Repository files navigation

StenoType

Instructions

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages