conda create -n env_name python=3.9
conda activate env_name
pip install -r requirement.txt check cicd
config/
├── args.json - arguments
└── config.py - configuration setup
Inside config.py, we'll add the code to define key directory locations (we'll add more configurations in later lessons as they're needed)
and inside args.json, we'll add the parameters that are relevant to data processing and model training.
text_tagging/
└── main.py - training/optimization pipelines
We'll define these core operations inside main.py as we move code from notebooks to the appropriate scripts below:
elt_data: extract, load and transform data.
optimize: tune hyperparameters to optimize for objective.
train_model: train a model using best parameters from optimization study.
load_artifacts: load trained artifacts from a given run.
predict_tag: predict a tag for a given input.
text_tagging/
├── main.py - training/optimization pipelines
└── utils.py - supplementary utilities
When it comes to migrating our code from notebooks to scripts, it's best to organize based on utility. For example, we can create scripts for the various stages of ML development such as data processing, training, evaluation, prediction, etc.:
text_tagging/
├── data.py - data processing utilities
├── evaluate.py - evaluation components
├── main.py - training/optimization pipelines
├── predict.py - inference utilities
├── train.py - training utilities
└── utils.py - supplementary utilities
- Install required packages: setup.py
docs_packages = [
"mkdocs==1.3.0",
"mkdocstrings==0.18.1"
]
Define our package:
setup(
...
install_requires=[required_packages],
extras_require={
"dev": docs_packages,
"docs": docs_packages,
},
)
Now we can install this package with:
python3 -m pip install -e ".[docs]"
python3 -m pip install -e ".[dev]"
- Initialize mkdocs
python3 -m mkdocs new .
.
├─ docs/
│ └─ index.md
└─ mkdocs.yml
- Next we'll create documentation files for each script in our text_tagging directory:
mkdir docs/text_tagging
cd docs/text_tagging
touch main.md utils.md data.md train.md evaluate.md predict.md
cd ../../
- Next we'll add text_tagging.<SCRIPT_NAME> to each file under docs/text_tagging.
# docs/text_tagging/data.md
::: text_tagging.data
- Finally, we'll add some configurations to our mkdocs.yml file that mkdocs automatically created:
# mkdocs.yml
site_name: TEXT TAGGING
nav:
- Home: index.md
- workflows:
- main: text_tagging/main.md
- tagifai:
- data: text_tagging/data.md
- evaluate: text_tagging/evaluate.md
- predict: text_tagging/predict.md
- train: text_tagging/train.md
- utils: text_tagging/utils.md
theme: readthedocs
plugins:
- mkdocstrings
watch:
- . # reload docs for any file changes
- Serve our documentation locally:
python3 -m mkdocs serve
Style and formatting conventions to keep our code looking consistent.
1. Tools
Black: an in-place reformatter that (mostly) adheres to PEP8.
isort: sorts and formats import statements inside Python scripts.
flake8: a code linter with stylistic conventions that adhere to PEP8.
# setup.py
style_packages = [
"black==22.3.0",
"flake8==3.9.2",
"isort==5.10.1"
]
# Define our package
setup(
...
extras_require={
"dev": docs_packages + style_packages,
"docs": docs_packages,
},
)
2. Configuration
touch pyproject.toml
# Black formatting
[tool.black]
line-length = 100
include = '\.pyi?$'
exclude = '''
/(
.eggs # exclude a few common directories in the
| .git # root of the project
| .hg
| .mypy_cache
| .tox
| venv
| _build
| buck-out
| build
| dist
)/
'''
# iSort
[tool.isort]
profile = "black"
line_length = 79
multi_line_output = 3
include_trailing_comma = true
virtual_env = "venv"
For Flake:
touch .flake8
[flake8]
exclude = venv
ignore = E501, W503, E226
max-line-length = 79
# E501: Line too long
# W503: Line break occurred before binary operator
# E226: Missing white space around arithmetic operator
3. Usage
black .
flake8
isort .
An automation tool that organizes commands for our application's processes.
touch Makefile
# Styling
.PHONY: style
style:
black .
flake8
isort .
# Cleaning
.PHONY: clean
clean: style
find . -type f -name "*.DS_Store" -ls -delete
find . | grep -E "(__pycache__|\.pyc|\.pyo)" | xargs rm -rf
find . | grep -E ".pytest_cache" | xargs rm -rf
find . | grep -E ".ipynb_checkpoints" | xargs rm -rf
find . | grep -E ".trash" | xargs rm -rf
rm -f .coverage
# Environment
.ONESHELL:
venv:
python3 -m venv venv
source venv/bin/activate
python3 -m pip install pip setuptools wheel
python3 -m pip install -e .
.PHONY: help
help:
@echo "Commands:"
@echo "venv : creates a virtual environment."
@echo "style : executes style formatting."
@echo "clean : cleans all unnecessary files."
make venv
make style
make clean
app/
├── api.py - FastAPI app
├── gunicorn.py - WSGI script
└── schemas.py - API model schemas
api.py: the main script that will include our API initialization and endpoints.
gunicorn.py: script for defining API worker configurations.
schemas.py: definitions for the different objects we'll use in our resource endpoints.
uvicorn app.api:app --host 0.0.0.0 --port 8000 --reload --reload-dir text_tagging --reload-dir app # dev
gunicorn -c app/gunicorn.py -k uvicorn.workers.UvicornWorker app.api:app # prod
1. Test Code
mkdir tests
cd tests
mkdir code
touch <SCRIPTS>
cd ../
test directory
tests/
└── code/
│ ├── test_data.py
│ ├── test_evaluate.py
│ ├── test_main.py
│ ├── test_predict.py
│ └── test_utils.py
Add test packages in setup.py
# setup.py
test_packages = [
"pytest==7.1.2",
"pytest-cov==2.10.1"
]
# Define our package
setup(
...
extras_require={
"dev": docs_packages + style_packages + test_packages,
"docs": docs_packages,
"test": test_packages,
},
)
Configuration
# Pytest
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = "test_*.py"
addopts = "--strict-markers --disable-pytest-warnings"
markers = [
"training: tests that involve training",
]
# Pytest coverage Exclusions
[tool.coverage.run]
omit = ["app/gunicorn.py"]
Execution
all tests
python3 -m pytest
tests under a directory
python3 -m pytest tests/food
Coverage
python3 -m pytest --cov text_tagging --cov-report html
2. Test Data
# setup.py
test_packages = [
"pytest==7.1.2",
"pytest-cov==2.10.1",
"great-expectations==0.15.15"
]
- Projects
cd tests
great_expectations init
This will set up a tests/great_expectations directory with the following structure:
tests/great_expectations/
├── checkpoints/
├── expectations/
├── plugins/
├── uncommitted/
├── .gitignore
└── great_expectations.yml
- Data source
The first step is to establish our datasource which tells Great Expectations where our data lives:
great_expectations datasource new
What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark) 👈
2. Relational database (SQL)
What are you processing your files with?
1. Pandas 👈
2. PySpark
Enter the path of the root directory where the data files are stored: ../data
- Suites
Create expectations manually, interactively or automatically and save them as suites (a set of expectations for a particular data asset).
great_expectations suite new
How would you like to create your Expectation Suite?
1. Manually, without interacting with a sample batch of data (default)
2. Interactively, with a sample batch of data 👈
3. Automatically, using a profiler
Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?
1. labeled_projects.csv
2. projects.csv 👈
3. tags.csv
Name the new Expectation Suite [projects.csv.warning]: projects
Define expectations in jupyter notebook:
Table expectations:
# Presence of features
validator.expect_table_columns_to_match_ordered_list(column_list=["id", "tag"])
Column expectations:
# id
validator.expect_column_values_to_be_unique(column="id")
# tag
validator.expect_column_values_to_not_be_null(column="tag")
validator.expect_column_values_to_be_of_type(column="tag", type_="str")
To edit a suite, we can execute the follow CLI command:
great_expectations suite edit <SUITE_NAME>
- Checkpoints
great_expectations checkpoint new projects
great_expectations checkpoint new tags
great_expectations checkpoint new labeled_projects
We have to change the lines for data_asset_name (which data asset to run the checkpoint suite on) and expectation_suite_name (name of the suite to use). For example, the projects checkpoint would use the projects.csv data asset and the projects suite.
great_expectations checkpoint run projects
great_expectations checkpoint run tags
great_expectations checkpoint run labeled_projects
- Documentation
great_expectations docs build
3. Test Model
The final aspect of testing ML systems involves how to test machine learning models during training, evaluation, inference and deployment. [Behavioral testing]
Let's create a target in our Makefile that will allow us to execute all of our tests with one call:
# Test
.PHONY: test
test:
pytest -m "not training"
cd tests && great_expectations checkpoint run projects
cd tests && great_expectations checkpoint run tags
cd tests && great_expectations checkpoint run labeled_projects
make test
Using the pre-commit git hooks to ensure checks before committing. To help us manage all these important steps, we can use pre-commit hooks, which will automatically be triggered when we try to perform a commit.
- Install
# setup.py
setup(
...
extras_require={
"dev": docs_packages + style_packages + test_packages + ["pre-commit==2.19.0"],
"docs": docs_packages,
"test": test_packages,
},
)
- Config
# Simple config
pre-commit sample-config > .pre-commit-config.yaml
cat .pre-commit-config.yaml
- Hooks
- Built-in
# Inside .pre-commit-config.yaml
...
- id: check-added-large-files
args: ['--maxkb=1000']
exclude: "notebooks/text_tagging.ipynb"
...
- Custom
# Inside .pre-commit-config.yaml
...
- repo: https://github.com/psf/black
rev: 20.8b1
hooks:
- id: black
args: []
files: .
...
- Local
# Inside .pre-commit-config.yaml
...
- repo: local
hooks:
- id: clean
name: clean
entry: make
args: ["clean"]
language: system
pass_filenames: false
- Commit
git add .
git commit -m <MESSAGE>
- Set up
# Initialization
pip install dvc==2.10.2
dvc init
- Remote storage
After initializing DVC, we can establish where our remote storage will be. We'll be creating and using the stores/blob directory as our remote storage but in a production setting this would be something like S3.
# Inside config/config.py
BLOB_STORE = Path(STORES_DIR, "blob")
BLOB_STORE.mkdir(parents=True, exist_ok=True)
We need to notify DVC about this storage location so it knows where to save the data assets:
dvc remote add -d storage stores/blob
- Add data
Now we're ready to add our data to our remote storage. This will automatically add the respective data assets to a .gitignore file (a new one will be created inside the data directory) and create pointer files which will point to where the data assets are actually stores (our remote storage).
# Add artifacts
dvc add data/projects.csv
dvc add data/tags.csv
dvc add data/labeled_projects.csv
all the pointer files that were created for each data artifact we added:
data
├── .gitignore
├── labeled_projects.csv
├── labeled_projects.csv.dvc
├── projects.csv
├── projects.csv.dvc
├── tags.csv
└── tags.csv.dvc
Each pointer file will contain the md5 hash, size and the location (with respect to the data directory) which we'll be checking into our git repository.
# data/projects.csv.dvc
outs:
- md5: b103754da50e2e3e969894aa94a490ee
size: 266992
path: projects.csv
- Push Now we're ready to push our artifacts to our blob store:
dvc push
If we inspect our storage (stores/blob), we'll can see that the data is efficiently stored:
# Remote storage
stores
└── blob
├── 3e
│ └── 173e183b81085ff2d2dc3f137020ba
├── 72
│ └── 2d428f0e7add4b359d287ec15d54ec
...
- Pull
When someone else wants to pull our data assets, we can use the pull command to fetch from our remote storage to our local directories
dvc pull
# Makefile
.PHONY: dvc
dvc:
dvc add data/projects.csv
dvc add data/tags.csv
dvc add data/labeled_projects.csv
dvc push
Packaging our application into reproducible and scalable containers.
- Dockerfile
touch Dockerfile
- Build images
docker build -t text_tagging:latest -f Dockerfile .
- Run containers
docker run -p 8000:8000 --name text_tagging text_tagging:latest
docker stop <CONTAINER_ID> # stop a running container
docker rm <CONTAINER_ID> # remove a container
docker stop $(docker ps -a -q) # stop all containers
docker rm $(docker ps -a -q) # remove all containers
Creating an interactive dashboard to visually inspect our application using Streamlit.
- Set up
# Setup
pip install streamlit==1.10.0
mkdir streamlit
touch streamlit/app.py
streamlit run streamlit/app.py
This will automatically open up the streamlit dashboard for us on http://localhost:8501.
To see these changes on our dashboard, we can refresh our dashboard page (press R) or set it Always rerun (press A).
- Sections
We'll start by outlining the sections we want to have in our dashboard by editing our streamlit/app.py script:
- Caching
@st.cache()
def load_data():
projects_fp = Path(config.DATA_DIR, "labeled_projects.csv")
df = pd.read_csv(projects_fp)
return df
Using workflows to establish continuous integration and delivery pipelines to reliably iterate on our application
Workflows
With GitHub Actions, we are creating automatic workflows to do something for us. We'll start by creating a .github/workflows directory to organize all of our workflows.
mkdir -p .github/workflows
touch .github/workflows/testing.yml
touch .github/workflows/documentation.yml
- Events
Workflows are triggered by an event, which can be something that occurs on a schedule (cron), webhook or manually. In our application, we'll be using the push and pull request webhook events to run the testing workflow when someone directly pushes or submits a PR to the main branch.
# .github/workflows/testing.yml
on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master
- Jobs
Once the event is triggered, a set of jobs run on a runner, which is the application that runs the job using a specific operating system. Our first (and only) job is test-code which runs on the latest version of ubuntu.
# .github/workflows/testing.yml
jobs:
test-code:
runs-on: ubuntu-latest
- Steps
Each job contains a series of steps which are executed in order. Each step has a name, as well as actions to use from the GitHub Action marketplace or commands we want to run. For the test-code job, the steps are to checkout the repo, install the necessary dependencies and run tests.
# .github/workflows/testing.yml
jobs:
test-code:
runs-on: ubuntu-latest
steps:
- name: Checkout repo
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.7.13
- name: Caching
uses: actions/cache@v2
with:
path: $/{/{ env.pythonLocation /}/}
key: $/{/{ env.pythonLocation /}/}-$/{/{ hashFiles('setup.py') /}/}-$/{/{ hashFiles('requirements.txt') /}/}
- name: Install dependencies
run: |
python3 -m pip install -e ".[test]" --no-cache-dir
- name: Execute tests
run: pytest tests/tagifai --ignore tests/code/test_main.py --ignore tests/code/test_data.py
python main.py elt-data
python main.py optimize --args-fp="config/args.json" --study-name="optimization" --num-trials=10
python main.py train-model --args-fp="config/args.json" --experiment-name="baselines" --run-name="sgd"
python main.py predict-tag --text="Transfer learning with transformers for text classification."