NBCU Metadata Enhancement

SETUP NEW MODEL:

STEP 0 - CHANGE DIRECTORY:

set working dir to training

cd training

STEP 1 - CHANGE CONFIGS:

training/pipeline_commands.sh
- Make sure to change CUSTOM_TFX_IMAGE and PIPELINE_NAME
- CUSTOM_TFX_IMAGE must also match the image in training/build.yaml and IMAGE in training/main/src/pipelines/configs.py
- PIPELINE_NAME must also match in training/main/src/pipelines/configs.py
GCS_BUCKET_NAME needs to be changed in training/main/src/pipelines/configs.py to whatever bucket you want to save the model to
pretty much everything high level is controlled in the config so change values to your hearts content i.e training length, input token length etc.
MOST IMPORTANT MAYBE: make sure to change MODEL_NAME in training/main/src/pipelines/configs.py to the model name, (something descriptive) as this is what is used for eval and querying metrics for a model

STEP 2 - SOURCE COMMANDS:

source pipeline_commands.sh

STEP 3 - SETUP PIPELINE:

you can now just use the commands

build_pipeline
update_pipeline
run_pipeline

to build/update/launch your kubeflow pipeline

NOTE: VERSIONS USED

tfx=0.28
skaffold=v1.17.0 (should work with v2, just change the build.yaml)
tensorflow=2.4.1

This codebase is split into two main folders:

Each of these folders should have their own readme, which explain how to run the pipeline/service locally or in the cloud, and any setup instructions.

The purpose of this readme is for any information which is required for both training and serving folders. Add any information here that you feel is relevant to both training and serving.

Python version management

We use pyenv to manage our Python version, and this is specified in a .python-version file in the serving and training directories.

To get started, cd into the training or serving directory, and make sure you have the correct python version installed with pyenv:

pyenv install `cat .python-version`
pyenv local `cat .python-version`  # Activate the correct python version

Package management

We currently use Poetry for python package management. We prefer to use Poetry rather than Pipenv or similar as Poetry seems to be simpler and faster.

Poetry

This is used to create a virtual environment and install all python packages inside. There are separate pyproject.yaml and poetry.lock files for both training and serving folders.

To install poetry just run:

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -

Once installed, you can install all the required packages (including dev packages) with the following:

poetry install

from either the serving dir or the training dir.

To enter the virtualenv in order to run commands with the installed packages, use

poetry shell

which will activate the virtualenv for you.

To add a new package, you can run:

poetry add <package>

See more advanced usage at https://python-poetry.org/docs/cli/#add. Make sure to commit the new poetry.lock file to git if you add any new packages.

Precommit

This project has an automatic linter setup which runs both Black and Flake8. A good writeup of this solution is here.

To setup precommit:

# Install pre-commit
pip install pre-commit

# Setup pre-commit hooks
pre-commit install

To run the precommit on all files:

make pre-commit

In addition to being ran on every commit, this is ensured with the linter stage in bibcd. This builds a dockerfile and runs the pre-commit on all files.

Building and Running the Pipeline

You can use the MLCLI tool in order to build and run your pipelines.

Please look at the README for instructions on how to get started.

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
linter		linter
serving		serving
training		training
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AutomaticTagging.pptx		AutomaticTagging.pptx
EmbeddedTagging-v1.key		EmbeddedTagging-v1.key
Makefile		Makefile
Model_Structure.pptx		Model_Structure.pptx
README.md		README.md
bibcd.yml		bibcd.yml
mlcli.yaml		mlcli.yaml
pipeline.yaml		pipeline.yaml
testdisk.log		testdisk.log
~$AutomaticTagging.pptx		~$AutomaticTagging.pptx
~$Model_Structure.pptx		~$Model_Structure.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBCU Metadata Enhancement

Python version management

Package management

Poetry

Precommit

Building and Running the Pipeline

About

Releases

Packages

Contributors 6

Languages

EdwardCuiPeacock/nbcu-metadata-enhancement

Folders and files

Latest commit

History

Repository files navigation

NBCU Metadata Enhancement

Python version management

Package management

Poetry

Precommit

Building and Running the Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages