In this project, we will apply the skills acquired in the Deploying a Scalable ML Pipeline in Production
course to develop a classification model on publicly available Census Bureau data.
We will create unit tests to monitor the model performance on various slices of the data. Then, we will deploy the final model using the FastAPI package and create API tests. Both the slice-validation and the API tests will be incorporated into a CI/CD framework using GitHub Actions. DVC will be used to manage the project dependencies (model, data, so on) and allow a reproducible pipeline.
Create a conda environment with environment.yml
:
conda env create --file environment.yml
To remove an environment in your terminal window run:
conda remove --name myenv --all
To list all available environments run:
conda env list
We will use DVC to manage and version the data processes that produce our final artifact. This mechanism allows you to organize the project better and reproduce your workflow/pipeline and results more quickly. The following steps are considered: a) data
, b) eda
, c) preprocess
, d) check data
, e) segregate
, f) train
, g) evaluate
and h) check model
.
It is possible to generate a big-picture of the data pipeline using the DVC. After we have create all stages, please run:
dvc dag --outs
and we can obtain the follow output:
An interactive dag can be visualized from dagshub: https://dagshub.com/ivanovitchm/mlops_nd_c3
It is assumed the data has already been fetched and stored at pipeline/01_data
.
Before starting the next pipeline stage, EDA, clone (git clone), or create a new git repository (git init) to the project. Right after, run:
dvc init
and track and version the file census.csv
using:
dvc add pipeline/data/census.csv
git add pipeline/data/.gitignore pipeline/01_data/census.csv.dvc
It is possible tracking data remotely with DVC. In this project we will use a S3 bucket
as configuration. Some aditional steps are necessary to setup the CLI environment.
- Install the AWS CLI tool.
- Sign in AWS Amazon using your user and password.
- From the Services drop down select
S3
and then clickCreate bucket
. - Give your bucket a name, the rest of the options can remain at their default.
- To use your new
S3 bucket
from the AWS CLI you will need to create anIAM user
with the appropriate permissions.- Sign in to the IAM console or from the Services drop down on the upper navigation bar.
- In the left navigation bar select
Users
, then chooseAdd user
. - Give the user a name and select
Programmatic access
. - In the permissions selector, search for S3 and give it
AmazonS3FullAccess
. Tags
are optional and can be skipped.- After reviewing your choices, click
create user
. - Configure your AWS CLI to use the Access key ID and Secret Access key.
aws configure
Right after this setup, add the S3 remote bucket
using:
dvc remote add name_of_remote_repo s3://name_of_s3_bucket
In our case, the configuration was:
dvc remote add s3remote s3://incomes3
To visualize the configuration run:
dvc remote list
Them push tracking files using:
dvc push --remote s3remote
For now, the artifacts generated by EDA are not tracked. EDA is only to understand the big-picture about the problem.
To create pipelines, use dvc run
to create stages. In each stage you can define dependencies, parameters, inputs, outputs, and specify the command that is run. In order to create the preprocess stage run:
dvc run -n preprocess \
-d pipeline/preprocess/run.py \
-d pipeline/data/census.csv \
-o pipeline/data/preprocessing_data.csv \
python pipeline/preprocess/run.py --input_artifact_name pipeline/data/census.csv \
--output_artifact_name pipeline/data/preprocessing_data.csv
To track the changes with git, run:
git add dvc.yaml pipeline/01_data/.gitignore dvc.lock
If everything is successful, new DVC files and a new artifact are generated in the repository. dvc.lock
and dvc.yaml
are used to manage the pipeline whereas the clean data, preprocessing_data.csv
, must be placed at pipeline/01_data
.
Now, given the data and pipeline are up to date is time to upload local files to remote repository, please run:
dvc push --remote s3remote
In this stage of the pipeline we will apply deterministic
and non-deterministic
tests to the preprocessing_data.csv
.
For most of the non-deterministic
tests you need a reference dataset, and a dataset to be tested. This is useful when retraining, to make sure that the new training dataset has a similar distribution to the original dataset and therefore the method that was used originally is expected to work well.
We will use the Kolmogorov-Smirnov test for goodness of fit. Remember that the 2 sample KS test is used to test whether two vectors come from the same distribution (null hypothesis), or from two different distributions (alternative hypothesis), and it is non-parametric.
All tests will be implemented using the pytest
. An important aspect when using pytest
is understanding the fixture's scope
works.
The scope of the fixture
can have a few legal values, described here. We are going to consider only session
but it possible to use function
. In the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example. Let's see this more closely run this DVC pipeline:
dvc run -n datacheck \
-p data.reference_dataset,data.ks_alpha \
-d pipeline/check_data/conftest.py \
-d pipeline/check_data/test_data.py \
-d pipeline/data/preprocessing_data.csv \
pytest pipeline/check_data -s -vv --sample_artifact pipeline/data/preprocessing_data.csv \
--param params.yaml
git add dvc.yaml dvc.lock
To update the remote repository, please run:
dvc push --remote s3remote
This stage of the pipeline splits the dataset into train/test, with the test accounting for 30% of the original dataset. The split is stratified according to the target to keep the same label balance. The test size, stratify column, and random seed used to reproducible issues can be changed in params.yaml file.
dvc run -n segregate \
-p data.test_size,data.stratify,main.random_seed \
-d pipeline/data/preprocessing_data.csv \
-d pipeline/segregate/run.py \
-o pipeline/data/train_data.csv \
-o pipeline/data/test_data.csv \
python pipeline/segregate/run.py --input_artifact pipeline/data/preprocessing_data.csv \
--param params.yaml
git add dvc.yaml dvc.lock
Now, given the data and pipeline are up to date is time to upload local files to remote repository, please run:
dvc push --remote s3remote
This stage of the pipeline works on the training of the model. For the sake of understanding, a simple Decision Tree model was used only for a proof of concept. The focus is on the development of a complete workflow. Hyperparameters are configured using the params.yaml
file. The workflow is based on Scikit-Learn Pipelines. Overall, we have three pipelines, one to process the numerical features and the other to transform categorical features. The third pipeline is used to join previous ones. The final pipeline and encoder used to codify the target variables are saved using joblib library.
dvc run -n train \
-p train.export_artifact,data.val_size,data.stratify,main.random_seed \
-M pipeline/data/train_scores.json \
-d pipeline/data/train_data.csv \
-d pipeline/train/run.py \
-d pipeline/train/helper.py \
-d pipeline/train/transformer_feature.py \
-o pipeline/data/model_export \
-o pipeline/data/encoder_export \
python pipeline/train/run.py --train_data pipeline/data/train_data.csv \
--param params.yaml --score_file pipeline/data/train_scores.json
git add dvc.yaml dvc.lock
Now, given the data and pipeline are up to date is time to upload local files to remote repository, please run:
dvc push --remote s3remote
In this stage we will build a component that fetches a model and target encoder and test them on the test dataset.
dvc run -n evaluate \
-M pipeline/data/test_scores.json \
-d pipeline/data/test_data.csv \
-d pipeline/evaluate/run.py \
-d pipeline/data/model_export \
-d pipeline/train/helper.py \
-d pipeline/train/transformer_feature.py \
-d pipeline/data/encoder_export \
-o pipeline/data/slice_output.json \
python pipeline/evaluate/run.py --test_data pipeline/data/test_data.csv \
--model pipeline/data/model_export \
--encoder pipeline/data/encoder_export \
--score_file pipeline/data/test_scores.json \
--slice_file pipeline/data/slice_output.json
git add dvc.yaml dvc.lock
Now, given the data and pipeline are up to date is time to upload local files to remote repository, please run:
dvc push --remote s3remote
In this stage we will write some unit tests to evaluate if any ML functions return the expected type.
dvc run -n check_model \
-d pipeline/check_model/conftest.py \
-d pipeline/check_model/test_model.py \
-d pipeline/train/transformer_feature.py \
-d pipeline/data/test_data.csv \
-d pipeline/data/model_export \
-d pipeline/data/encoder_export \
pytest pipeline/check_model -s -vv --test_data pipeline/data/test_data.csv \
--model pipeline/data/model_export \
--encoder pipeline/data/encoder_export
git add dvc.yaml dvc.lock
Now, given the data and pipeline are up to date is time to upload local files to remote repository, please run:
dvc push --remote s3remote
GitHub Actions is a new way to do continuous integration and continuous deployment right from Git. It comes with a handful of workflows such as running Linter or maybe checking in if there are any flake8 errors. You can configure GitHub Actions to trigger a deployment pipeline on push, giving you a fully automated Continuous Integration pipelines that helps keep your team in sync with the latest features.
In the context of this project, it was set up a simple workflow from the GitHub web interface according to the figure below.
All Actions are configured into the file main_actions.yml
. Details can be found here.
- Triggers the workflow:
on push
- The type of runner that the job will run on:
macos-latest
- Allow the job access the repository from:
actions/checkout@v2
- Set up DVC actions:
iterative/setup-dvc@v1
- Configure AWS credentials:
aws-actions/configure-aws-credentials@v1
- Note that is necessary to configure the repository secrets:
settings/secrets/actions
- Note that is necessary to configure the repository secrets:
- Set up Python version:
actions/setup-python@v3.0.0
- And run diverse commands to install dependencies (VM used by Github Actions), lint (flake8), pytest (data, model and API tests) and download artifacts (dvc pull).
A successful build of the proposed workflow is shown below. It is highlighted the DVC action where it is possible to note all tracked artifacts.
FastAPI is a modern API framework that allows you to write code quickly without sacrificing flexibility or extensibility. FastAPI will be used in this project in order to conclude the CI/CD stages. After we build our API locally and test it, we will deploy it to Heroku and test it again once live.
It was created and implemented a RESTful API using FastAPI containing the following features:
- Pydantic body with all columns and hints of an instance.
- A
schema extra
describing an typical example of an instance according FastAPI documentation. - GET on the root giving a welcome message.
- POST on the predict in order to proceed model inference.
- Three unit test to test the API, one for the GET adn two for POST (high income >50k and low income <=50k)
The API is implemented in the api/main.py
whereas tests are on api/test_main.py
.
For the sake of understanding and during the development, the API was constanly tested using:
uvicorn api.main:app --reload
and using these addresses:
http://127.0.0.1:8000/
http://127.0.0.1:8000/docs
The screenshot below show a view of the API docs.
For test the API, please run:
pytest api/test_main.py -vv
- The first step is to create a new app in Heroku. It is very important to connect the APP to our Github repository and enable the automatic deploys. The figure below show a screenshot of this configuration.
- Install the Heroku CLI following the instructions.
- In the root folder of the project, check the heroku projects already created.
heroku apps
- Check buildpack is correct:
heroku buildpacks --app income-census-project
- Update the buildpack if necessary:
heroku buildpacks:set heroku/python --app income-census-project
- Set up access to AWS on Heroku, if using the CLI:
heroku config:set AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=yyy --app income-census-project
- We need to give Heroku the ability to pull in data from DVC upon app start up. We will install a buildpack that allows the installation of apt-files and then define the Aptfile that contains a path to DVC. I.e., in the CLI run:
heroku buildpacks:add --index 1 heroku-community/apt --app income-census-project
- Then in your root project folder create a file called
Aptfile
that specifies the release of DVC you want installed, e.g. https://github.com/iterative/dvc/releases/download/2.0.18/dvc_2.0.18_amd64.deb - Add the following code block to your
api/main.py
:
import os
if "DYNO" in os.environ and os.path.isdir(".dvc"):
os.system("dvc config core.no_scm true")
if os.system("dvc pull -r s3remote") != 0:
exit("dvc pull failed")
os.system("rm -r .dvc .apt/usr/lib/dvc")
- Configure the remote repository for Heroku:
heroku git:remote --app income-census-project
- Push all files to remote repository in Heroku. The command below will install all packages indicated in
requirements.txt
in Heroku VM. Note the free tier support a slug until 500 MB. Our slug size was 488 MB!!!
git push heroku main
- Check the remote files run:
heroku run bash --app income-census-project
- If all previous steps were done with successful you will see the message below after open:
https://income-census-project.herokuapp.com/
.
- At the end, we can run a query live API from:
python api/query_live_api.py
The result of this run can be better visualized in the figure below.
- In case you prefer, the query live API can be tested using FastAPI docs. In a browser, please run:
https://income-census-project.herokuapp.com/docs
Right after, click in Try it out
. Note, in this case, that example is different that used in the step 14!!!
The response of the query live API can be also visualized in the browser according to figure below: