This repo is deprecated. Development has continued here: https://github.com/kubeflow/code-intelligence/tree/master/Issue-Embeddings
Motivation: Issue Label Bot predicts 3 generic issue labels: bug
, feature request
and question
. However, it would be nice to predict personalized issue labels instead of generic ones. To accomplish this, we can use the issues that are already labeled in a repository as training data for a model that can predict personalized issue labels. One challenge with this approach is there is often a small number of labeled issues in each repository. In order to mitigate this concern, we utilize transfer-learning by training a language trained over 16 million GitHub Issues and fine-tune this to predict issue labels.
The manifest files in /deployment define a service that will return 2400 dimensional embeddings given the text of an issue. The api endpoints are hosted on https://gh-issue-labeler.com/
All routes expect POST
requests with a header containing a Token
field. Below is a list of endpoints:
-
https://gh-issue-labeler.com/text
: expects a json payload oftitle
andbody
and returns a single 2,400 dimensional vector that represents latent features of the text. For example, this is how you would interact with this endpoint from python:import requests import json import numpy as np from passlib.apps import custom_app_context as pwd_context API_ENDPOINT = 'https://gh-issue-labeler.com/text' API_KEY = 'YOUR_API_KEY' # Contact maintainers to get this # A toy example of a GitHub Issue title and body data = {'title': 'Fix the issue', 'body': 'I am encountering an error\n when trying to push the button.'} # sending post request and saving response as response object r = requests.post(url=API_ENDPOINT, headers={'Token':pwd_context.hash(API_KEY)}, json=data) # convert string back into a numpy array embeddings = np.frombuffer(r.content, dtype='<f4')
-
https://gh-issue-labeler.com//all_issues/<owner>/<repo>
🚧 this will return a numpy array of the shape (# of labeled issues in repo, 2400), as well a list of all the labels for each issue. This endpoint is still under construction.
The language model is built with the fastai library. The notebooks folder contains a tutorial of the steps you need to build a language model:
- 01_AcquireData.ipynb: Describes how to acquire and pre-process the data using mdparse, which parses and annotates markdown files.
- 02_fastai_DataBunch.ipynb: The fastai library uses an object called a Databunch around pytorch's dataloader class to encapuslate additional metadata and functionality. This notebook walks through the steps of preparing this data structure which will be used by the model for training.
- 03_Create_Model.ipynb: This walks through the process of instantiating the fastai language model, along with callbacks for early stopping, logging and saving of artifacts. Additionally, this notebook illustrates how to train the model.
- 04_Inference.ipynb: shows how to use the language model to perform inference in order to extract latent features in the form of a 2,400 dimension vector from GitHub Issue text. This notebook shows how to load the Databunch and model and save only the model for inference. /flask_app/inference.py contains utilities that makes the inference process easier.
The hyperparam_sweep folder contains lm_tune.py which is a script used to train the model. Most importantly, we use this script in conjuction with hyper-parameter sweeps in Weights & Biases
We were able to try 538 different hyper-paramter combinations using Bayesian and random grid search concurrently to choose the best model:
The hyperparameter tuning process is described in greater detail in the hyperparam_sweep folder.
- /notebooks: contains notebooks on how to gather and clean the data and train the language model.
- /hyperparam_sweep: this folder contains instructions on doing a hyper-parameter sweep with Weights & Biases.
- /flask_app: code for a flask app that is the API that listens for POST requests.
- /script: this directory contains the entry point for running the REST API server that end users will interface with:
- /deployment: This directory contains files that are helpful in deploying the app.
- Dockerfile this is the definition of the container that is used to run the flask app. The build for this container is hosted on DockerHub at hamelsmu/issuefeatures-api-cpu.
- *.yaml: these files relate to a Kubernetees deployment.
-
model for inference (965 MB): https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_22zkdqlr.pkl
-
encoder (for fine-tuning w/a classifier) (965 MB): https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_encoder_22zkdqlr.pth
-
fastai.databunch (27.1 GB): https://storage.googleapis.com/issue_label_bot/model/lang_model/data_save.pkl
-
checkpointed model (2.29 GB): https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/best_22zkdqlr.pth
https://app.wandb.ai/github/issues_lang_model/runs/22zkdqlr/overview