Skip to content

Latest commit

 

History

History
576 lines (349 loc) · 25.7 KB

README.md

File metadata and controls

576 lines (349 loc) · 25.7 KB

alt text

PDK - Pachyderm | Determined | KServe

Deployment and Setup Guide

Date/Revision: February 23, 2024

This page contains step-by-step guides for installing the infrastructure and all necessary components for the PDK environment, covering different Kubernetes plaforms.

The first step is to provision an environment with the PDK components. If you don't have an environment available, follow the links below for deployment information. If you already have access to an environment, you can go directly to the Creating the PDK Dogs-and-Cats Assets section for PDK-specific instructions.

 

Deploying the PDK Components

Click on your platform of choice to access the specific deployment guide for that platform.

alt text alt text alt text

 

Regardless of the selected platform, do not proceed until you have a fully functioning cluster, with MLDM, MLDE and KServe.

 

Creating PDK Dogs-and-Cats Assets

The diagram below illustrates how the PDK flow will work:

alt text

  • A new project with 2 pipelines will be created in MLDM
    • Data (a collection of files) will be uploaded to the MLDM repository
    • This repository will be the input for the the 'Train' pipeline, which will start automatically, to create a new Experiment in MLDE
    • To generate a new Experiment, the pipeline will need to download the assets (configuration + code) from github
      • Technically speaking, these assets can be stored anywhere, but github is the easiest way to maintain the code
    • Once the experiment is completed, MLDM will register the top checkpoint to MLDE and it will create a configuration file with information about the model, which will be used to deploy it
    • This configuration file will be stored in the repository that will serve as the input for the 'Deploy' pipeline, which will download the checkpoint from MLDE and deploy the model to KServe, using the configuration file generated by the 'Train' pipeline.
  • Each pipeline will pull a specific container image from a registry. You will find instructions in this repository about how you can create your own images and push them to your own registry.
  • The container images have the logic to initiate the MLDE experiment (Train pipeline) and deploy the model to KServe (Deploy pipeline). You can study the code by looking through the example folders.
  • Sensitive data, like server URLs, passwords, etc will be stored in a secret (that you created as part of the platform setup) and mapped to environment variables at runtime; the MLDM pipeline is then able to pass those values forward to MLDE.

 

This repository includes an Examples folder with a number of sample PDK projects. Each PDK example will have 3 main components:

  • MLDE Experiment: includes the code and other assets that will be needed to train the model inside MLDE. This code will be pushed to Github, where it will be downloaded by the MLDM pipeline.

  • Docker Images: the 'Train' and 'Deploy' images described above. Since the same training image can be used with all models, it will be located in a separated folder. As part of this document, we will walk through the steps of building and pushing the images to the registry. Optionally, you can use the hosted images from the provided example (if you don't want to build and push your own container images).

  • Pipeline definitions: these are JSON files that will create the Train and Deploy pipelines, assigning the docker images that will be used by each.

In this guide, we will deploy one of the example projects (Dogs and Cats), to ensure that all PDK components are working properly. For each example, you will find a brief description of how to set it up and run the PDK flow, as well as sample data to test the inference service.

If you are planning on creating your own images, or change the experiment settings, the easiest way is to fork the repository, clone it locally and make the changes:

git clone https://github.com/determined-ai/pdk.git .

 

Once you clone the repository, go to the examples/dog-cat folder, which contains all the necessary assets:

alt text

 

Set Environment Variables from Config Map

If you've followed the setup instructions provided in this repository, you now have a working cluster with a config map that contains a number of environment variables. Use the commands below to load them. If you did not follow the instructions to create your environment, some of these variables will be required to setup PDK; make sure to assigne the proper values to them.

 

export AZ_REGION=$(kubectl get cm pdk-config -o=jsonpath='{.data.region}') && echo $AZ_REGION

export MLDM_BUCKET_NAME=$(kubectl get cm pdk-config -o=jsonpath='{.data.mldm_bucket_name}') && echo $MLDM_BUCKET_NAME

export MLDM_HOST=$(kubectl get cm pdk-config -o=jsonpath='{.data.mldm_host}') && echo $MLDM_HOST

export MLDM_PORT=$(kubectl get cm pdk-config -o=jsonpath='{.data.mldm_port}') && echo $MLDM_PORT

export MLDM_URL=$(kubectl get cm pdk-config -o=jsonpath='{.data.mldm_url}') && echo $MLDM_URL

export MLDE_BUCKET_NAME=$(kubectl get cm pdk-config -o=jsonpath='{.data.mlde_bucket_name}') && echo $MLDE_BUCKET_NAME

export MLDE_HOST=$(kubectl get cm pdk-config -o=jsonpath='{.data.mlde_host}') && echo $MLDE_HOST

export MLDE_PORT=$(kubectl get cm pdk-config -o=jsonpath='{.data.mlde_port}') && echo $MLDE_PORT

export MLDE_URL=$(kubectl get cm pdk-config -o=jsonpath='{.data.mlde_url}') && echo $MLDE_URL

export MODEL_ASSETS_BUCKET_NAME=$(kubectl get cm pdk-config -o=jsonpath='{.data.model_assets_bucket_name}') && echo $MODEL_ASSETS_BUCKET_NAME

export KSERVE_MODELS_NAMESPACE=$(kubectl get cm pdk-config -o=jsonpath='{.data.kserve_model_namespace}') && echo $KSERVE_MODELS_NAMESPACE

export INGRESS_HOST=$(kubectl get cm pdk-config -o=jsonpath='{.data.kserve_ingress_host}') && echo $INGRESS_HOST

export INGRESS_PORT=$(kubectl get cm pdk-config -o=jsonpath='{.data.kserve_ingress_port}') && echo $INGRESS_PORT

export DB_CONNECTION_URL=$(kubectl get cm pdk-config -o=jsonpath='{.data.db_connection_string}') && echo $DB_CONNECTION_URL

export REGISTRY_URI=$(kubectl get cm pdk-config -o=jsonpath='{.data.registry_uri}') && echo $REGISTRY_URI

export NAME=$(kubectl get cm pdk-config -o=jsonpath='{.data.pdk_name}') && echo $NAME

 

Create folders in the Storage Bucket for Dogs-and-Cats PDK

Create the following folders in the storage bucket:

  • dogs-and-cats
  • dogs-and-cats/config
  • dogs-and-cats/model-store

Check the Useful Commands section in the AWS and GCP deployment pages for help.

PS: This step can be skipped if you are not using storage buckets in your PDK environment.

 

MLDE Experiment files

In the dog-cat folder, go to the experiment folder.

We'll take a look at some of the files, but the only one you should (optionally) change is the const.yaml file, which contains the configuration for the MLDE experiment.

In this file, we can see the MLDM parameters that will be used by MLDE. They have empty values because they will be assigned at runtime (through kubernetes secrets mapped as environment variables). Keep in mind that the pipeline is running in one container, which has direct access to the input images, while the MLDE experiment will run in a different container (inside the GPU node), which does not have direct access. For that reason, the training code will connect to MLDM and download the input images so it can train the model.

Also, a Worspace and Project were configured for this experiment. You can change the name of both:

alt text

 

Don't forget to create a Workspace and a Project in MLDE with the same name as configured in the file; otherwise, the experiment will fail to run. This can be done in the Workspaces page in the UI.

alt text

 

The workspace and project can also be created through the command line:

export DET_MASTER=${MLDE_HOST}:${MLDE_PORT}

det u login admin

det w create "PDK Demos"

det p create "PDK Demos" pdk-dogs-and-cats

 

A brief description of the Experiment files:

data.py : this file contains logic to retrieve and structure the training images from the MLDM repository. Study the download_pach_repo to understand how the client is pulling the files. The unique ID of each commit is sent through the environment variables (and can be seen in the logs).

  model_def.py: this is the script that controls model training. It uses the PyTorchTrial API to provide out-of-the-box capabilities like distributed training, checkpointing, hyperparameter search, etc. without the need for additional coding.

  startup_hook.sh: this file will be executed for every experiment, before the python script. It's a good place to run any routines required to prepare the container for the execution of the python code.


 

The experiment files don't need to be modified, except for the Workspace and Project name in the const.yaml file. Do keep in mind that, at runtime, the pipeline will pull this code from Github. Any changes to any of the files need to be uploaded to your repository.

 

MLDM Images

In this step, we'll setup the Train and Deploy images. There's no need to change any of the code, though we will review some key parts of it.

In the examples/training_container folder, you will find the files for the Train image. If you wish to test this flow as-is, there will be no need to rebuild or push new images to the repository. However, assuming that you want to make changes to it (or adapt this code to a different type of model), we'll review the necessary steps.

Taking a closer look at the train.py file, we can see that a number of input arguments are being parsed:

...
def parse_args():
    parser = argparse.ArgumentParser(
        description="Determined AI Experiment Runner"
    )

    parser.add_argument(
        "--config",
        type=str,
        help="Determined's experiment configuration file",
    )

    parser.add_argument(
        "--git-url",
        type=str,
        help="Git URL of the repository containing the model code",
    )

    parser.add_argument(
        "--git-ref",
        type=str,
        help="Git Commit/Tag/Branch to use",
    )
...

These arguments are configured in the pipeline definition. Depending on how your PDK environment is setup, you will need to configure additional attributes.

Then, in a different function, the MLDM information is mapped to variables that will be sent to MLDE:

def setup_config(config_file, repo, pipeline, job_id, project):
    config = read_config(config_file)
    config["data"]["pachyderm"]["host"] = os.getenv("PACHD_LB_SERVICE_HOST")
    config["data"]["pachyderm"]["port"] = os.getenv("PACHD_LB_SERVICE_PORT")
    config["data"]["pachyderm"]["repo"] = repo
    config["data"]["pachyderm"]["branch"] = job_id
    config["data"]["pachyderm"]["token"] = os.getenv("PAC_TOKEN")
    config["data"]["pachyderm"]["project"] = project

    config["labels"] = [repo, job_id, pipeline]

    return config

The environment variables will be mapped from Kubernetes secrets. We will see this mapping in the pipeline definition file.

You should, of couse, study the entire code. The goal here was to show how data in Kubernetes secrets can be mapped as environment variables and used inside MLDM pipelines, that will then send them over to MLDE for model training.

 

Build and push the Train image

If you are not planning on building your own images, you can skip this section. The pipelines are configured by default with public images you can use for testing.

Before continuing, make sure Docker Desktop is running.

The first step will be to build and push the Train image. There's no need to make changes to any files.

PS: If you're running this on a MacOS, there are additional settings needed to set the image for linux (otherwise it will fail to run). They are included below.

Go to the /examples/training_container folder and run the commands below to build, tag and push the Train image. Don't forget to rename the images.

export DOCKER_DEFAULT_PLATFORM=linux/amd64

docker buildx build --pull --no-cache --platform linux/amd64 -t ${REGISTRY_URI}/<your_name>_cats_dogs_train:1.0 .

# IF YOU ARE USING ECR, YOU MUST CREATE THE REPOSITORY FIRST
## Execute these commands only for AWS ECR
export REGISTRY_URL=<the value of REGISTRY_URI without the repository name>
aws ecr get-login-password --region ${AZ_REGION} | docker login --username AWS --password-stdin ${REGISTRY_URL}
aws ecr create-repository --repository-name=${NAME}/<your_name>_cats_dogs_train --region ${AZ_REGION}
##

docker push ${REGISTRY_URI}/<your_name>_cats_dogs_train:1.0

 

The build process can take several minutes. PS: If you do need to rebuild this image for whatever reason, make sure to change the version number (and update the pipeline JSON file with the new version number). This will force the container to pull the new version of the image, instead of using the cached one.

Check your registry to make sure the image was pushed successfully. Review the command output for EOF or other error messages and retry as needed.

 

Build and push the Deploy image

Go to the examples/dog-cat/container/deploy folder. The code for deploy is more complicated, since it involves KServe as well. Study the code to understand how the process is being handled (the common.py file contains utility functions).

Run these commands to build, tag and push the Deploy image:

cd ../deploy

docker buildx build --pull --no-cache --platform linux/amd64 -t ${REGISTRY_URI}/<your_name>_cats_dogs_deploy:1.0 .

# IF YOU ARE USING ECR, YOU MUST CREATE THE REPOSITORY FIRST
## Execute these commands only for AWS ECR
aws ecr get-login-password --region ${AZ_REGION} | docker login --username AWS --password-stdin ${REGISTRY_URL}
aws ecr create-repository --repository-name=${NAME}/denisd_cats_dogs_deploy --region ${AZ_REGION}
##

docker push ${REGISTRY_URI}/<your_name>_cats_dogs_deploy:1.0

This can take a long time, because of the dependencies needed to build the image.

 

Commit code to Github

If you made any changes to any of the files, make sure to push them to your Github repo.

PS: if you're using a Mac, delete the .DS_store files before committing (or add it to .gitignore).

find . -name '.DS_Store' -type f -delete

git add .

git status

git commit -m 'changed experiment files'

git remote add origin https://github.com/YOUR_GIT_USERNAME/pdk.git

git push -u origin main

 

Preparing the Train pipeline

First, create the project and repo in MLDM:

pachctl connect ${MLDM_URL}

pachctl config set active-context ${MLDM_URL}

pachctl create project pdk-dogs-and-cats

pachctl config update context --project pdk-dogs-and-cats

pachctl create repo dogs-and-cats-data

pachctl list repo

Next, go to the pipelines folder.

In this folder, there are 2 sets of pipeline definition files, one for on-prem (shared folders) and another for environments that use cloud buckets. There differences between them are:

  • On-prem environments must mount the shared folder into the containers where the pipelines will run, so the code has access to the files there. This can be configured through the pod_patch parameter, which is applied as a JSON Patch. Within this parameter, set the path to your shared folders (in our deployment example, we use the /mnt/efs/shared_fs path). More information about the pod_patch configuration can be found in the Documentation page.
  • In on-prem environments, the pipeline containers must run as root to avoid permission errors in the shared folder.
  • In on-prem environments, a service account parameter must be set, to allow the deployment code to access the MLDM repository through the S3 interface. For environments that use cloud storage, these permissions are granted through service account permission mapping.

If you have a cloud-based environment, use the training-pipeline.json and deployment-pipeline.json files.

If you have an on-prem environment with shared folders, use the _onprem_training-pipeline.json and _onprem_deployment-pipeline.json files.

In the Training pipeline file, change the command line to point to your github repo (if you want to run your own code), and the image name to match the image you just pushed. You can leave the default values, if you did not create an image or made any changes to the experiment code.

"stdin": [
      "python train.py --git-url https://git@github.com:/determined-ai/pdk.git --git-ref main --sub-dir examples/dog-cat/experiment --config const.yaml --repo dogs-and-cats-data --model dogs-and-cats --project pdk-dogs-and-cats"
    ],
    "image": "pachyderm/pdk:train-v0.0.3",

Now we're ready to create the pipelines.

 

Step 2.3: Creating the Pipelines

Go back to the pipelines folder and create the pipeline (make sure to use the right file for your environment):

pachctl create pipeline -f training-pipeline.json

pachctl list pipelines

 

The MLDM UI will show the new Project, the repository and the pipeline:

alt text

 

Each new pipeline will create a pod in the ${MLDM_NAMESPACE} namespace. With the cluster defaults in place, the pod will be deleted if there are no active workloads to process. Check the status of the Pod before continuing. imgPullBackError means the cluster was unable to pull the image from your registry. Other errors might indicate lack of permissions, etc.

Next, create the deployment pipeline:

The deployment pipeline will deploy the trained model for inference (KServe). To recap, it completes the following steps:

  1. Gets triggered when a new checkpoint is stored in the dogs-and-cats-model repo.
  2. Pulls the checkpoint from MLDE and loads the trial/model.
  3. Saves the model as a ScriptModule.
  4. Creates a .mar file from the ScriptModule and the custom TorchServe handler.
  5. Creates the config.properties file for the model.
  6. Uploads the .mar file and the config.properties file to the GCS bucket.
  7. Connects to the K8s cluster and creates the InferenceService. If the pipeline runs for the first time, it will create a brand new InferenceService. If an older version of the InferenceService already exists, it will do a rolling update of the InferenceService using the updated model.
  8. Waits for the InferenceService to be available and provides the URL.

 

As before, select the correct the JSON file based on you environment and update the image name and the arguments.

If you have a cloud environment, make sure to set the following parameters in the command line:

  • cloud-model-host: aws or gcp
  • cloud-model-bucket: the bucket used to store models for KServe (${MODEL_ASSETS_BUCKET_NAME})

For on-prem, these attributes are not necessary, and the service account configured in the file is correct.

Also, replace the path to your image, or use the default value.

 "stdin": [
      "python deploy.py --deployment-name dog-cat --cloud-model-host gcp --cloud-model-bucket pdk-repo-models --resource-requests cpu=2,memory=8Gi --resource-limits cpu=10,memory=8Gi"
    ],
    "image": "pachyderm/pdk:dog-cat-deploy-v0.0.3",

 

Create the deploy pipeline:

pachctl create pipeline -f deployment-pipeline.json

pachctl list pipelines

 

The MLDM UI should now display both pipelines, connected (since the output of the Train pipeline is the input for the Deploy pipeline)

alt text

 

It will take a few minutes for the pipeline to pull the image from the registry. The status will change to 'Success' in the UI once the pipeline is up and running.

Our environment should now be ready to receive and process data.

 

Step 3: Running the Pipeline

As mentioned before, the pipelines automatically run when new data is committed to the input repository dogs-and-cats.

Some sample images of dogs and cats can be found in the sample-data folder. Unzip the dataset-dog-cat.zip file to obtain a sample dataset that can be used to train the model.

With the command below you can commit all images in the dog-cat directory of your machine to the folder data1 in the MLDM repository dogs-and-cats. The folder data1 will be created as part of the commit process; make sure to increment the number if you need to re-upload this folder (otherwise it won't be considered as new data by MLDM).

IMPORTANT: While the folder data1 can have any name, do not use the words "dog(s)" or "cat(s)" as it will impact the labeling of the images in the data pre-processing code.

PS: If you're using a MacOS and browsed through the images, delete all .DS_Store files before uploading, as they can break the pipeline (code doesn't handle that exception).

find ./dog-cat/ -name '.DS_Store' -type f -delete

pachctl put file dogs-and-cats-data@master:/data1 -f ./dog-cat -r

Once the uploads are complete, MLDM will start the training pipeline. At this stage, check the MLDE UI to see the experiment run. Once it completes, also check the MLDE Model Registry to see a new model registered. The Model Version name will be equal to the MLDM Commit ID.

The new experiment will appear in the project inside your Workspace:

alt text

The experiment might take a minute to start, as it's preparing the environment. IF there are no GPUs available, a new node will be provisioned automatically.

 

Once the training is complete, the deployment pipeline will be executed. You can look at the logs of the pipeline execution by clicking on Pipeline, then click on Subjob - Running. You should see a message in the logs about the model being deployed to KServe.

alt text

Once the pipeline execution completes, you should have a new InferenceService called dogcat-deploy in the Namespace models. You can validate that with this command:

kubectl -n ${KSERVE_MODELS_NAMESPACE} get inferenceservices

This is the expected output of this command:

kubectl -n ${KSERVE_MODELS_NAMESPACE} get inferenceservices
NAME           URL                                      READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
dog-cat        http://dog-cat.models.example.com        True           100                              dog-cat-predictor-00001        2m5s
sklearn-iris   http://sklearn-iris.models.example.com   True           100                              sklearn-iris-predictor-00001   120m

It might take a minute for the inference service to go from Unknown to True.

 

Step 4: Making a Prediction

With everything ready to go, it is time to make a prediction with the dogcat-deploy InferenceService.

KServe expects data to be submitted in the JSON format. For this simple test, you can find cat.json and dog.json in the sample-data directory.

If you want to convert your own images to JSON, you can use the img2bytearray.py Python script in the internal Github repo.

Once the JSON files are ready, we can make a call to the inference service.

 

To make a prediction, you can use the curl command below. First, let's submit the cat.json file. Replace the IP with your istio-ingressgateway external IP Address and execute the command.

curl -v \
-H "Content-Type: application/json" \
-H "Host: dog-cat.models.example.com" \
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/dogs-and-cats:predict \
-d @./cat.json

Then, make a prediction for dog.json by replacing the IP with your istio-ingressgateway external IP Address and executing the command.

curl -v \
-H "Content-Type: application/json" \
-H "Host: dog-cat.models.example.com" \
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/dogs-and-cats:predict \
-d @./dog.json

If all goes well, you should get the predictions returned for both the cat.json and the dog.json examples with the HTTP status 200 (OK).

For cat.json the response should be a class 1 prediction and for the dog.json it should be a class 0 prediction.

alt text

If this works, you have successfully deployed the Pachyderm-Determined-KServe (PDK) environment.

Remaining Work / Known Issues

Pending work that needs to be done:

  • None

Known issues:

  • For the dogs-and-cats use case, if the committed folder has dogs or cats in the name, the images will be incorreclty labeled for training.

License

MIT