PDK - Pachyderm | Determined | KServe

Deployment Guide for Google Cloud

Date/Revision: February 23, 2024

This guide will walk you through the steps of deploying the PDK components to Google Cloud.

Reference Architecture

The installation will be performed on the following hardware:

3x e2-standard-16 CPU-based nodes (16 vCPUs, 64GB RAM, 1000GB SSD)
2x n1-standard-8 GPU-based nodes (4 NVIDIA-T4, 16 vCPUs, 64GB RAM, 1000GB SSD)

The 3 CPU-based nodes will be used to run the services for all 3 products, and the MLDM pipelines. The GPU-based nodes will be used to run MLDE experiments.

The following software versions will be used for this installation:

Python: 3.8 and 3.9
Kubernetes (K8s): latest supported (currently 1.27)
Postgres: 13
MLDE (Determined.AI): latest (currently 0.28.1)
MLDM (Pachyderm): latest (currently 2.8.4)
KServe: 0.12.0-rc0 (Quickstart Environment)

PS: some of the commands used here are sensitive to the version of the product(s) listed above.

Prerequisites

To follow this documentation you will need:

The following applications, installed and configured in your computer:
- kubectl
- docker (you'll need docker desktop or similar to create and push images)
- git (to clone the repository with the examples)
- gcloud (make sure it's initialized and logged in; basic client configuration is out of scope for this doc)
- helm
- jq
- openssl (to generate a random password for the MLDE admin)
- patchctl (the MLDM command line client)
- det (the MLDE command line client)
Access to a Google Cloud account
A Project in Google Cloud, where your user has the following roles:
- Cloud SQL Admin
- Compute Network Admin
- Kubernetes Engine Admin
- Policy Tag Admin
- Project IAM Admin
- Role Administrator
- Storage Admin
- Artifact Registry Administrator
- A Custom role, with the following assigned permissions:
  - iam.serviceAccounts.actAs
  - iam.serviceAccounts.create
  - iam.serviceAccounts.delete
  - iam.serviceAccounts.disable
  - iam.serviceAccounts.enable
  - iam.serviceAccounts.get
  - iam.serviceAccounts.getIamPolicy
  - iam.serviceAccounts.setIamPolicy

The lack of these permissions will cause some commands to fail. Check your permissions if you run into any issues.

Installing the Cluster

In this section, we will execute the following steps:

01 - Set Environment Variables

02 - Test the pre-req applications and configure the gcloud client

03 - Create the main service account and custom role

04 - Create the GKE cluster

05 - Create the GPU node pool in the cluster

06 - Create Storage buckets

07 - Create Postgres Database

08 - Create static IP for MLDM

09 - Configure security settings for MLDM - Loki

10 - Configure security settings for the MLDE GPU Node Pool

11 - Deploy KServe

12 - Create static IP for MLDE

13 - Deploy nginx, configured to use the static IP

14 - Prepare MLDE installation assets

15 - Create configuration .yaml file for MLDM and MLDE

16 - Install MLDM and MLDE using Helm

17 - Create new Ingress for MLDE

18 - Retrieve MLDM and MLDE IP addresses and configure command line clients

19 - (Optional) Test Components

20 - Prepare for PDK Setup

21 - [Optional] Configure KServe UI

22 - [Optional] Prepare Docker and the Container Registry

23 - Save data to Config Map

24 - Create Cleanup Script

There is also a list of GCP-specific Useful Commands at the bottom of the page.

IMPORTANT: These steps were created and tested on an M1 MacOS computer. Some of the commands might work differently (or not at all) in other operating systems. Check the command documentation for an alternative syntax, if you are using a different OS.

NOTE: It's recommended to run these instructions one at a time, so you can diagnose in case of issues. The syntax for some of the commands documented here might become invalid, as new versions of these applications are released.

Step 1 - Set Environment Variables

You should only need to change the first block of variables.

All commands listed throghout this document must be executed in the same terminal window.

PS: Keep in mind that custom roles in Google Cloud will take 7 days to be deleted, and they cannot be named after an existing role, even if that role is deleted. Effectively, you cannot reuse the same role name for 7 days after deleting it. Because of that, we are adding a dynamic suffix to the GSA_ROLE_NAME variable. That way, you can reinstall the cluster immediately without running into errors when creating the role.

# MODIFY THESE VARIABLES
export PROJECT_ID="your-google-cloud-project-id"
export NAME="your-name-pdk"
# Role names cannot have spaces, special characters or dashes.
export GSA_ROLE="your_gsa_role_name"

# Create dynamic appendix for role name
export ROLE_SUFFIX=$(openssl rand -base64 12 | tr -dc A-Za-z0-9 | head -c5)
export GSA_ROLE_NAME="${GSA_ROLE}_${ROLE_SUFFIX}"


# These can be modified as needed
export GCP_REGION="us-central1"
export GCP_ZONE="us-central1-c"
export K8S_VERSION="1.27.3-gke.100"
export KSERVE_MODELS_NAMESPACE="models"
export CLUSTER_MACHINE_TYPE="e2-standard-16"
export GPU_MACHINE_TYPE="n1-standard-16"
export SQL_CPU="2"
export SQL_MEM="7680MB"

# You should not need to modify any of these variables
export CLUSTER_NAME="${NAME}-cluster"
export MLDM_BUCKET_NAME="${NAME}-repo-mldm"
export MLDE_BUCKET_NAME="${NAME}-repo-mlde"
export LOKI_BUCKET_NAME="${NAME}-logs-gcs"
export MODEL_ASSETS_BUCKET_NAME="${NAME}-repo-models"
export CLOUDSQL_INSTANCE_NAME="${NAME}-sql"
export GSA_NAME="${NAME}-gsa"
export LOKI_GSA_NAME="${NAME}-loki-gsa"
export STATIC_IP_NAME="${NAME}-ip"
export MLDE_STATIC_IP_NAME="${NAME}-mlde-ip"
export KSERVE_STATIC_IP_NAME="${NAME}-kserve-ip"

export ROLE1="roles/cloudsql.client"
export ROLE2="roles/storage.admin"
export ROLE3="roles/storage.objectCreator"
export ROLE4="roles/container.admin"
export ROLE5="roles/containerregistry.ServiceAgent"

export SERVICE_ACCOUNT="${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
export LOKI_SERVICE_ACCOUNT="${LOKI_GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
export PACH_WI="serviceAccount:${PROJECT_ID}.svc.id.goog[default/pachyderm]"
export SIDECAR_WI="serviceAccount:${PROJECT_ID}.svc.id.goog[default/pachyderm-worker]"
export CLOUDSQLAUTHPROXY_WI="serviceAccount:${PROJECT_ID}.svc.id.goog[default/k8s-cloudsql-auth-proxy]"
export MLDE_WI="serviceAccount:${PROJECT_ID}.svc.id.goog[default/determined-master-determinedai]"
export MLDE_DF_WI="serviceAccount:${PROJECT_ID}.svc.id.goog[default/default]"
export MLDE_GPU_WI="serviceAccount:${PROJECT_ID}.svc.id.goog[gpu-pool/default]"
export MLDE_KS_WI="serviceAccount:${PROJECT_ID}.svc.id.goog[${KSERVE_MODELS_NAMESPACE}/default]"

# Generate admin password for MLDE (or set your own password)
export ADMIN_PASSWORD=$(openssl rand -base64 32 | tr -dc A-Za-z0-9 | head -c16)

# Optionally, set a different password for the database:
export SQL_ADMIN_PASSWORD="${ADMIN_PASSWORD}"

Step 2 - Test the pre-req applications and configure the gcloud client

Make sure all these commands return successfully. If one of them fails, fix the issue before continuing.

kubectl version --client=true
gcloud version
helm version
pachctl version
det version
jq --version

gcloud config set project ${PROJECT_ID}
gcloud config set compute/zone ${GCP_ZONE}
gcloud config set container/cluster ${CLUSTER_NAME}
gcloud services enable container.googleapis.com
gcloud services enable sqladmin.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable artifactregistry.googleapis.com

Step 3 - Create the main service account and custom role

In this step, we create the Service Account and custom role that will be used by the different services.

gcloud iam service-accounts create ${GSA_NAME}

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${SERVICE_ACCOUNT}" \
    --role="${ROLE1}"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${SERVICE_ACCOUNT}" \
    --role="${ROLE2}"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${SERVICE_ACCOUNT}" \
    --role="${ROLE3}"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${SERVICE_ACCOUNT}" \
    --role="${ROLE4}"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${SERVICE_ACCOUNT}" \
    --role="${ROLE5}"

gcloud iam roles create ${GSA_ROLE_NAME} \
  --project=${PROJECT_ID} \
  --title=${GSA_ROLE_NAME} \
  --description="Additional permissions" \
  --stage GA \
  --permissions=storage.multipartUploads.abort,storage.multipartUploads.create,storage.multipartUploads.list,storage.multipartUploads.listParts,storage.objects.create,storage.objects.delete,storage.objects.get,storage.objects.getIamPolicy,storage.objects.list,storage.objects.update,iam.serviceAccounts.getIamPolicy,iam.serviceAccounts.setIamPolicy,iam.serviceAccounts.getAccessToken

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
 --member="serviceAccount:${SERVICE_ACCOUNT}" \
 --role="projects/${PROJECT_ID}/roles/${GSA_ROLE_NAME}"

PS: the list of permissions in the create role command must be in a single line. Careful when copying and pasting.

Also, the create role command may return a warning message saying 'API is not enabled for permissions'. This message can be safely ignored.

Step 4 - Create the GKE cluster

This command will create the cluster with the CPU node pool.

gcloud container clusters create ${CLUSTER_NAME} \
 	--project ${PROJECT_ID} \
 	--zone ${GCP_ZONE} \
 	--cluster-version ${K8S_VERSION} \
 	--release-channel "None" \
 	--machine-type ${CLUSTER_MACHINE_TYPE} \
 	--image-type "COS_CONTAINERD" \
 	--disk-type="pd-ssd" \
  --disk-size "1000" \
 	--metadata disable-legacy-endpoints=true \
 	--service-account ${SERVICE_ACCOUNT} \
 	--num-nodes "3" \
 	--logging=SYSTEM,WORKLOAD \
 	--monitoring=SYSTEM \
 	--enable-ip-alias \
 	--network "projects/${PROJECT_ID}/global/networks/default" \
 	--subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" \
 	--no-enable-intra-node-visibility \
 	--default-max-pods-per-node "220" \
 	--enable-autoscaling \
 	--min-nodes "3" \
 	--max-nodes "6" \
 	--location-policy "BALANCED" \
 	--security-posture=standard \
 	--workload-vulnerability-scanning=disabled \
  --enable-master-authorized-networks \
  --master-authorized-networks 0.0.0.0/0 \
 	--addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver,GcpFilestoreCsiDriver \
 	--no-enable-autoupgrade \
 	--enable-autorepair \
 	--max-surge-upgrade 1 \
 	--max-unavailable-upgrade 0 \
 	--enable-shielded-nodes \
  --enable-dataplane-v2 \
 	--workload-pool=${PROJECT_ID}.svc.id.goog \
 	--workload-metadata="GKE_METADATA" \
 	--node-locations ${GCP_ZONE} \
  --tags pdk

This process will take several minutes. The output message will show the cluster configuration. You can also check the status of the provisioning in the Google Cloud Console.

Step 5 - Create the GPU node pool in the cluster

The configuration used here will provision 4 GPUs per node. You can change it to count=2 or count=1, as needed.

gcloud container node-pools create "gpu-pool" \
	--project ${PROJECT_ID} \
	--cluster ${CLUSTER_NAME} \
	--zone ${GCP_ZONE} \
	--node-version ${K8S_VERSION} \
	--machine-type ${GPU_MACHINE_TYPE} \
	--accelerator type=nvidia-tesla-t4,count=4 \
	--image-type "COS_CONTAINERD" \
	--disk-type="pd-ssd" \
  --disk-size "1000" \
	--node-labels nodegroup-role=gpu-worker \
	--metadata disable-legacy-endpoints=true \
  --node-taints nvidia.com/gpu=present:NoSchedule \
	--num-nodes "1" \
	--enable-autoscaling \
	--min-nodes "1" \
	--max-nodes "4" \
	--location-policy "BALANCED" \
	--enable-autoupgrade \
	--enable-autorepair \
	--max-surge-upgrade 1 \
	--max-unavailable-upgrade 0 \
  --scopes=storage-full,cloud-platform \
	--node-locations ${GCP_ZONE} \
  --tags pdk

This can take several minutes to complete. If it takes more than 1 hour, it will timeout the client. If that happens, track the progress of the provisioning process through the Google Cloud web console.

Once the GPU node pool is provisioned, all nodes should show up as ready in the console:

After the cluster is created, configure your kubectl context:

gcloud container clusters get-credentials ${CLUSTER_NAME}

At this point, you should be able to run kubectl get nodes to see the list of 5 nodes. Make sure this is working before continuing.

Depending on your environment, you might need to grant additional permissions to your kubernetes user. Run this command to make sure you won't run into permissions errors:

kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value account)

Step 6 - Create Storage buckets

We'll create 4 storage buckets: 1 for MLDE, 2 for MLDM and 1 to store models for KServe.

gsutil mb -l ${GCP_REGION} gs://${MLDM_BUCKET_NAME}

gsutil mb -l ${GCP_REGION} gs://${LOKI_BUCKET_NAME}

gsutil mb -l ${GCP_REGION} gs://${MLDE_BUCKET_NAME}

gsutil mb -l ${GCP_REGION} gs://${MODEL_ASSETS_BUCKET_NAME}

Step 7 - Create Postgres Database

Use this command to provision a cloud Postgres database:

gcloud sql instances create ${CLOUDSQL_INSTANCE_NAME} \
  --database-version=POSTGRES_13 \
  --cpu=${SQL_CPU} \
  --memory=${SQL_MEM} \
  --zone=${GCP_ZONE} \
  --availability-type=ZONAL \
  --storage-size=50GB \
  --storage-type=SSD \
  --storage-auto-increase \
  --root-password=${SQL_ADMIN_PASSWORD}

PS: If you want to use Postgres 14, additional configuration steps will be needed, because the default password encryption was changed between versions. Make sure to check the documentation for additional steps.

Once the instance is available, create the databases for MLDM and MLDE:

gcloud sql databases create pachyderm -i "${CLOUDSQL_INSTANCE_NAME}"

gcloud sql databases create dex -i "${CLOUDSQL_INSTANCE_NAME}"

gcloud sql databases create determined -i "${CLOUDSQL_INSTANCE_NAME}"

Finally, save the database connection string to an environment variable:

export CLOUDSQL_CONNECTION_NAME=$(gcloud sql instances describe ${CLOUDSQL_INSTANCE_NAME} --format=json | jq ."connectionName")

echo $CLOUDSQL_CONNECTION_NAME

Step 8 - Create static IP for MLDM

Create a static IP to be used by MLDM and save it to an environment variable.

gcloud compute addresses create ${STATIC_IP_NAME} --region=${GCP_REGION}

export STATIC_IP_ADDR=$(gcloud compute addresses describe ${STATIC_IP_NAME} --region=${GCP_REGION} --format=json --flatten=address | jq '.[]' )

echo $STATIC_IP_ADDR

Step 9 - Configure security settings for MLDM - Loki

In this step, we create a service account for the MLDM - Loki service. Also, we'll bind some MLDE and MLDM services to the main service account (so they can access the DB and the storage bucket).

gcloud iam service-accounts create ${LOKI_GSA_NAME}

gcloud iam service-accounts keys create "${LOKI_GSA_NAME}-key.json" --iam-account="$LOKI_SERVICE_ACCOUNT"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${LOKI_SERVICE_ACCOUNT}" \
    --role="${ROLE2}"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${LOKI_SERVICE_ACCOUNT}" \
    --role="${ROLE3}"

kubectl -n default create secret generic loki-service-account --from-file="${LOKI_GSA_NAME}-key.json"

gcloud iam service-accounts add-iam-policy-binding ${SERVICE_ACCOUNT} \
    --role roles/iam.workloadIdentityUser \
    --member "${PACH_WI}"

gcloud iam service-accounts add-iam-policy-binding ${SERVICE_ACCOUNT} \
    --role roles/iam.workloadIdentityUser \
    --member "${SIDECAR_WI}"

gcloud iam service-accounts add-iam-policy-binding ${SERVICE_ACCOUNT} \
    --role roles/iam.workloadIdentityUser \
    --member "${CLOUDSQLAUTHPROXY_WI}"

gcloud iam service-accounts add-iam-policy-binding ${SERVICE_ACCOUNT} \
    --role roles/iam.workloadIdentityUser \
    --member "${MLDE_WI}"

gcloud iam service-accounts add-iam-policy-binding ${SERVICE_ACCOUNT} \
    --role roles/iam.workloadIdentityUser \
    --member "${MLDE_DF_WI}"

Step 10 - Configure security settings for the MLDE GPU Node Pool

First, we need to deploy the GPU daemonset. Without it, your nodes will show 0 allocatable GPUs:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

This will take a couple of minutes to take effect. Run a kubectl get nodes and then a kubectl describe node <node_name> in one of the GPU nodes. Look for the Allocatable section; if you don't see a nvidia.com/gpu: 4 entry in that list, wait a few seconds and check again. Do not continue until the GPUs are being listed as allocatable.

You can also use this command to list allocatable GPUs per node:

kubectl describe nodes  |  tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p'  | grep -e Name  -e  nvidia.com  | perl -pe 's/\n//'  |  perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g'  | sed '1s/^/Node Available(GPUs)  Used(GPUs)/' | sed 's/$/ 0 0 0/'  | awk '{print $1, $2, $3}'  | column -t

For the MLDE setup, we'll configure the GPU nodes to be in a separate Resource Pool. This requires a new namespace for the GPU nodes, as experiments will run as pods in that namespace (that will then be bound to the GPU nodes). We will need to grant permissions for the service accounts in both default and gpu-pool namespaces, so experiments, notebooks and other tasks can save and read checkpoint files from the storage bucket. The service account for MLDE is created by the installer, so we will set those permissions once MLDE is deployed. For now, run these commands to grant bucket access permissions:

kubectl create ns gpu-pool

kubectl annotate serviceaccount default \
  -n default \
  iam.gke.io/gcp-service-account=${SERVICE_ACCOUNT}

kubectl annotate serviceaccount default \
  -n gpu-pool \
  iam.gke.io/gcp-service-account=${SERVICE_ACCOUNT}

gcloud iam service-accounts add-iam-policy-binding ${SERVICE_ACCOUNT} \
    --role roles/iam.workloadIdentityUser \
    --member "${MLDE_GPU_WI}"

Step 11 - Deploy KServe

KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases. It provides performant, standardized inference protocol across ML frameworks, including PyTorch, TensorFlow and Keras. Additionally, KServe provides features such as automatic scaling, monitoring, and logging, making it easy to manage deployed models in production. Advanced features, such as canary rollouts, experiments, ensembles and transformers are also available. For more information on KServe, please visit the official KServe documentation.

Installation of KServe is very straightforward, because we are using the Quick Start. This is naturally only an option for test or demo environments;

curl -s "https://raw.githubusercontent.com/kserve/kserve/master/hack/quick_install.sh" | bash

After running this command, wait about 10 minutes for all the services to be properly initialized.

Step 12 - Create static IP for MLDE

gcloud compute addresses create ${MLDE_STATIC_IP_NAME} --region=${GCP_REGION}

export MLDE_STATIC_IP_ADDR=$(gcloud compute addresses describe ${MLDE_STATIC_IP_NAME} --region=${GCP_REGION} --format=json --flatten=address | jq '.[]' )

echo $MLDE_STATIC_IP_ADDR

Step 13 - Deploy nginx, configured to use the static IP

Nginx will be configured to listen on port 80 (instead of the default 8080 used by MLDE).

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx

helm repo update

helm upgrade --install -n ingress-system --create-namespace ingress-nginx ingress-nginx/ingress-nginx \
  --set controller.service.loadBalancerIP=${MLDE_STATIC_IP_ADDR}

PS: This could take a couple of minutes. Make sure to run kubectl -n ingress-system get svc and make sure that the External IP column matches the static IP that was provisioned. If the field is empty (or showing Pending), investigate and fix it before continuing.

Make sure the Public IP matches the static IP you've created in the previous step:

Step 14 - Prepare MLDE installation assets

First, we need to provision shared storage for MLDE. This will be used to provide a shared folder that can be used by Notebook users in the MLDE UI. This will allow users to save their own code and notebooks in a persistent volume.

For this exercise, we will create a 200GB disk. You can increase this capacity as needed.

First, create the disk:

gcloud compute disks create --size=200GB --zone=${GCP_ZONE} ${NAME}-pdk-nfs-disk

Next, we'll create a NFS server that uses this disk:

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs-server
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - name: nfs-server
        image: gcr.io/google_containers/volume-nfs:0.8
        ports:
          - name: nfs
            containerPort: 2049
          - name: mountd
            containerPort: 20048
          - name: rpcbind
            containerPort: 111
        securityContext:
          privileged: true
        volumeMounts:
          - mountPath: /exports
            name: mypvc
      volumes:
        - name: mypvc
          gcePersistentDisk:
            pdName: ${NAME}-pdk-nfs-disk
            fsType: ext4
EOF

Next, create a Service to expose the disk:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: nfs-server
spec:
  ports:
    - name: nfs
      port: 2049
    - name: mountd
      port: 20048
    - name: rpcbind
      port: 111
  selector:
    role: nfs-server
EOF

Because Persistent Volume Claims are namespace-bound objects, and we'll have 2 namespaces (default, where CPU jobs will run, and gpu-pool, there GPU jobs will run), we'll need two Persistent Volume Claims, tied to 2 Persistent Volumes. We'll create the PVs as ReadWriteMany to ensure concurrent access by the different PVCs.

Run this command to create the PV and PVC for the default namespace:

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs
spec:
  capacity:
    storage: 200Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: nfs-server.default.svc.cluster.local
    path: "/"

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pdk-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 200Gi
EOF

Now run this command to create the PV and PVC for the gpu-pool namespace:

kubectl -n gpu-pool apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-gpu
spec:
  capacity:
    storage: 200Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: nfs-server.default.svc.cluster.local
    path: "/"

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pdk-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 200Gi
EOF

Step 15 - Create configuration .yaml file for MLDM and MLDE

This command will create a .yaml file that you can review in a text editor.

cat <<EOF > helm_values.yaml
deployTarget: "GOOGLE"

pachd:
  enabled: true
  lokiDeploy: true
  lokiLogging: true
  storage:
    google:
      bucket: "${MLDM_BUCKET_NAME}"
  serviceAccount:
    additionalAnnotations:
      iam.gke.io/gcp-service-account: "${SERVICE_ACCOUNT}"
    create: true
    name: "pachyderm"
  worker:
    serviceAccount:
      additionalAnnotations:
        iam.gke.io/gcp-service-account: "${SERVICE_ACCOUNT}"
      create: true
      name: "pachyderm-worker"

cloudsqlAuthProxy:
  enabled: true
  connectionName: ${CLOUDSQL_CONNECTION_NAME}
  serviceAccount: "${SERVICE_ACCOUNT}"
  resources:
    requests:
      memory: "500Mi"
      cpu:    "250m"

postgresql:
  enabled: false

global:
  postgresql:
    postgresqlHost: "cloudsql-auth-proxy.default.svc.cluster.local."
    postgresqlPort: "5432"
    postgresqlSSL: "disable"
    postgresqlUsername: "postgres"
    postgresqlPassword: "${SQL_ADMIN_PASSWORD}"

loki-stack:
  loki:
    env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: /etc/secrets/${LOKI_GSA_NAME}-key.json
    extraVolumes:
      - name: loki-service-account
        secret:
          secretName: loki-service-account
    extraVolumeMounts:
      - name: loki-service-account
        mountPath: /etc/secrets
    config:
      schema_config:
        configs:
        - from: 1989-11-09
          object_store: gcs
          store: boltdb
          schema: v11
          index:
            prefix: loki_index_
          chunks:
            prefix: loki_chunks_
      storage_config:
        gcs:
          bucket_name: "${LOKI_BUCKET_NAME}"
        # https://github.com/grafana/loki/issues/256
        bigtable:
          project: project
          instance: instance
        boltdb:
          directory: /data/loki/indices
  grafana:
    enabled: false

proxy:
  enabled: true
  service:
    type: LoadBalancer
    loadBalancerIP: ${STATIC_IP_ADDR}
    httpPort: 80
    httpsPort: 443
  tls:
    enabled: false
  
determined:
  enabled: true
  detVersion: "0.28.1"
  imageRegistry: determinedai
  enterpriseEdition: false
  imagePullSecretName:
  masterPort: 8080
  createNonNamespacedObjects: true
  useNodePortForMaster: true
  defaultPassword: ${ADMIN_PASSWORD}
  db:
    hostAddress: "cloudsql-auth-proxy.default.svc.cluster.local."
    name: determined
    user: postgres
    password: ${SQL_ADMIN_PASSWORD}
    port: 5432
  checkpointStorage:
    saveExperimentBest: 0
    saveTrialBest: 1
    saveTrialLatest: 1
    type: gcs
    bucket: ${MLDE_BUCKET_NAME}
  maxSlotsPerPod: 4
  masterCpuRequest: "2"
  masterMemRequest: 8Gi
  taskContainerDefaults:
    cpuImage: determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-6eceaca
    gpuImage: determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.11-gpu-6eceaca
    cpuPodSpec:
      apiVersion: v1
      kind: Pod
      spec:
        containers:
          - name: determined-container
            volumeMounts:
              - name: pdk-pvc-nfs
                mountPath: /run/determined/workdir/shared_fs
        volumes:
          - name: pdk-pvc-nfs
            persistentVolumeClaim:
              claimName: pdk-pvc
    gpuPodSpec:
      apiVersion: v1
      kind: Pod
      spec:
        containers:
          - name: determined-container
            volumeMounts:
              - name: pdk-pvc-nfs
                mountPath: /run/determined/workdir/shared_fs
        volumes:
          - name: pdk-pvc-nfs
            persistentVolumeClaim:
              claimName: pdk-pvc
      metadata:
        labels:
          nodegroup-role: gpu-worker
  telemetry:
    enabled: true
  defaultAuxResourcePool: default
  defaultComputeResourcePool: gpu-pool    
  resourcePools:
    - pool_name: default
      task_container_defaults:
        cpu_pod_spec:
          apiVersion: v1
          kind: Pod
          spec:
            containers:
              - name: determined-container
                volumeMounts:
                  - name: pdk-pvc-nfs
                    mountPath: /run/determined/workdir/shared_fs
            volumes:
              - name: pdk-pvc-nfs
                persistentVolumeClaim:
                  claimName: pdk-pvc
    - pool_name: gpu-pool
      max_aux_containers_per_agent: 1
      kubernetes_namespace: gpu-pool
      task_container_defaults:
        gpu_pod_spec:
          apiVersion: v1
          kind: Pod
          spec:
            containers:
              - name: determined-container
                volumeMounts:
                  - name: pdk-pvc-nfs
                    mountPath: /run/determined/workdir/shared_fs
            volumes:
              - name: pdk-pvc-nfs
                persistentVolumeClaim:
                  claimName: pdk-pvc
            tolerations:
              - key: "nvidia.com/gpu"
                operator: "Equal"
                value: "present"
                effect: "NoSchedule"
EOF

Step 16 - Install MLDM and MLDE using Helm

First, download the charts for MLDM:

helm repo add pachyderm https://helm.pachyderm.com

helm repo update

Then run the installer, referencing the .yaml file you just created:

helm install pachyderm -f ./helm_values.yaml pachyderm/pachyderm --namespace default

Once the installation is complete, annotate the MLDE service accounts so they have access to the storage bucket:

kubectl annotate serviceaccount default \
  -n default \
  iam.gke.io/gcp-service-account=${SERVICE_ACCOUNT}

kubectl annotate serviceaccount determined-master-pachyderm \
  -n default \
  iam.gke.io/gcp-service-account=${SERVICE_ACCOUNT}

Give it a couple of minutes for all the services to be up and running. You can run kubectl get pods to see if any pods failed or are stuck. Wait until all pods are running before continuing.

Step 17 - Create new Ingress for MLDE

Because we're using a static IP, we'll need to create an ingress for MLDE.

Use this command to create the new ingress:

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mlde-ingress
  namespace: default
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "160m"  
spec:
  ingressClassName: nginx
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: determined-master-service-pachyderm
            port:
              number: 8080
EOF

Step 18 - Retrieve MLDM and MLDE IP addresses and configure command line clients

In this step, we'll configure the pachctl and det clients. This will be important later, as we create the project, repo and pipeline for the PDK environment.

export STATIC_IP_ADDR_NO_QUOTES=$(echo "$STATIC_IP_ADDR" | tr -d '"')

export PACH_URL="http://${STATIC_IP_ADDR_NO_QUOTES}:80"

echo "MLDM Address: http://${STATIC_IP_ADDR_NO_QUOTES}:80"

pachctl connect ${PACH_URL}

pachctl config set active-context ${PACH_URL}

At this time, you should be able to access the MLDM UI using the URL that was printed in the terminal:

A new capabiity of MLDM 2.8.1 is Cluster Defaults, which allows admins to set configurations that will be automatically applied to all pipelines (unless explicitly overwritten by the pipeline definition). Click the Cluster Defaults button and replace the existing configuration with the following:

{
  "createPipelineRequest": {
    "resourceRequests": {
      "cpu": 1,
      "memory": "256Mi",
      "disk": "1Gi"
    },
    "datumTries" : 1,
    "parallelismSpec": {"constant": 1},
    "autoscaling" : true,
    "sidecarResourceRequests": {
      "cpu": 1,
      "memory": "256Mi",
      "disk": "1Gi"
    }
  }
}

The configuration changes we are applying will:

Disable retries in case of failed jobs (datumTries: 1)
Run each pipeline in a single pod (parallelismSpec - constant: 1)
Automatically delete the pod once the pipeline is completed to release the CPU (autoscaling: true)

Do keep in mind that these settings are not recommended for all environments, especially Production.

Click Continue and Save to apply the changes.

Similar to the steps taken for MLDM, save the static IP for MLDE in an environment variable:

export MLDE_STATIC_IP_ADDR_NO_QUOTES=$(echo "$MLDE_STATIC_IP_ADDR" | tr -d '"')

echo "MLDE Address: http://${MLDE_STATIC_IP_ADDR_NO_QUOTES}:80"

export DET_MASTER=${MLDE_STATIC_IP_ADDR_NO_QUOTES}:80

echo ${ADMIN_PASSWORD}

det u login admin

(use the password that was displayed in the previous command)

Once logged in, you can run det e list, which should return an empty list. If you get an error message, check the MLDE pod and service for errors.

You should also be able to access the MLDE UI using the URL printed on the terminal. Login as user admin (leave password field empty). Once logged in, check the Cluster page and make sure the GPU resources are showing up:

Step 19 - (Optional) Test Components

In this optional step, we can test MLDM (by creating a pipeline) and MLDE (by creating an experiment)

To test MLDM, run the following commands. They will create a new project, repo and pipeline, which will run for a few images we'll download.

mkdir opencv

cd opencv

pachctl create project openCV

pachctl config update context --project openCV

pachctl create repo images

pachctl list repo

wget http://imgur.com/46Q8nDz.png

pachctl put file images@master:liberty.png -f 46Q8nDz.png

pachctl list commit images

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/2.6.x/examples/opencv/edges.json

wget http://imgur.com/8MN9Kg0.png

pachctl put file images@master:AT-AT.png -f 8MN9Kg0.png

wget http://imgur.com/g2QnNqa.png

pachctl put file images@master:kitten.png -f g2QnNqa.png

pachctl list commit images

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/2.6.x/examples/opencv/montage.json

pachctl list job

cd ..

At this time, you should see the OpenCV project and pipeline in the MLDM UI:

You should also be able to see the chunks in the storage bucket. This confirms that MLDM is able to connect to the bucket.

PS: Do not modify or delete chunks, as it will break integrity.

To test MLDE, you'll need to download the examples from the public github:

mkdir mlde_exp

cd mlde_exp

git clone https://github.com/determined-ai/determined.git .

Once the command completes, run this command to modify the ./examples/computer_vision/iris_tf_keras/const.yaml file that will be used to run the experiment:

cat <<EOF > ./examples/computer_vision/iris_tf_keras/const.yaml
name: iris_tf_keras_const
data:
  train_url: http://download.tensorflow.org/data/iris_training.csv
  test_url: http://download.tensorflow.org/data/iris_test.csv
hyperparameters:
  learning_rate: 1.0e-4
  learning_rate_decay: 1.0e-6
  layer1_dense_size: 16
  global_batch_size: 16
searcher:
  name: single
  metric: val_categorical_accuracy
  smaller_is_better: false
  max_length:
    batches: 500
entrypoint: model_def:IrisTrial
EOF

The changes we're making will reduce the global batch size and max batch lenght, to speed up training.

Then run this command to create the experiment:

det experiment create -f ./examples/computer_vision/iris_tf_keras/const.yaml ./examples/computer_vision/iris_tf_keras

cd ..

If this command fails, make sure the DET_MASTER environment variable is set. Keep in mind that the client can timeout while it's waiting for the experiment image to be pulled. It does not mean the experiment has failed; you can still check the UI or use det e list to see the current status of this experiment.

Your experiment will appear under Uncategorized (we will change that for the PDK experiments). You can track the Experiment log to see if there are any issues.

You can also check the MLDE bucket in Google Cloud Storage to see the checkpoints that were saved:

This confirms that MLDE is able to access the Storage buckets as well.

Finally, go to the MLDE Home Page and click the Launch JupyterLab button. In the configuration pop-up, select the Uncategorized workspace, set the Resource Pool to gpu-pool (this is important, because the default pool has no GPUs available) and set the number of Slots (GPUs) to 1. Or set the number of slots to 0 and select the default Resource Pool to create a CPU-based notebook environment.

Click Launch to start the JupyterLab environment.

The first run should take about one minute to pull and run the image.

In the new tab, make sure the shared_fs folder is listed. In this folder, users will be able to permanently store their model assets, notebooks and other files.

PS: If the JupyterLab environment fails to load, it might be because the shared volume failed to mount. Run kubectl -n gpu-pool describe pod against the new pod to see why the pod failed to run.

Step 20 - Prepare for PDK Setup

These next steps will help us verify that KServe is working properly, and they will also setup some pre-requisites for the PDK flow (specifically, the step where models are deployed to KServe).

A deeper explanation of the PDK flow is provided in main deployment page; for now, let's make sure KServe is working as expected.

First, create a new namespace that will be used to serve models (through KServe):

kubectl create namespace ${KSERVE_MODELS_NAMESPACE}

Next, we will test KServe by deploying a sample model. This can be done by running the following command:

kubectl apply -n ${KSERVE_MODELS_NAMESPACE} -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
EOF

Give it a minute and check the status of the InferenceService:

kubectl get inferenceservices sklearn-iris -n ${KSERVE_MODELS_NAMESPACE}

It should go from Unknown to Ready, which means the deployment was successful.

Next, check the IP address for the Ingress:

kubectl get svc istio-ingressgateway -n istio-system

export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')

export SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n ${KSERVE_MODELS_NAMESPACE} -o jsonpath='{.status.url}' | cut -d "/" -f 3)

echo $INGRESS_HOST

echo $INGRESS_PORT

echo $SERVICE_HOSTNAME

Make sure the command output includes a public IP. Fix any issues before continuing.

Next, we'll create a simple input file that we can use to test this model (by generating a prediction):

cat <<EOF > "./iris-input.json"
{
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]
  ]
}
EOF

Then, use this command to generate the prediction:

curl -v \
-H "Content-Type: application/json" \
-H "Host: ${SERVICE_HOSTNAME}" \
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict \
-d @./iris-input.json

What we're looking for in the ouput is a status code of 200 (success) and a JSON payload with a list of values:

Make sure you get a valid response before continuing, as the deployment will fail if KServer is not properly setup.

The last part of this step is basically some housekeeping tasks to set the stage for the PDK flow.

First, we create a secret that will store variables that will be used by both MLDM pipelines and MLDE experiments.

cat <<EOF > "./pipeline-secret.yaml"
apiVersion: v1
kind: Secret
metadata:
  name: pipeline-secret
stringData:
  det_master: "${MLDE_STATIC_IP_ADDR_NO_QUOTES}:80"
  det_user: "admin"
  det_password: "${ADMIN_PASSWORD}"
  pac_token: ""
  pachd_lb_service_host: "${STATIC_IP_ADDR_NO_QUOTES}"
  pachd_lb_service_port: "80"
  kserve_namespace: "${KSERVE_MODELS_NAMESPACE}"
EOF

A more detailed explanation of these attributes:

det_master: The address to the MLDE instance. Instead of using a URL, you can also point it to the service running in the default namespace (determined-master-service-determinedai).
det_user: MLDE user that will create experiments and pull models.
det_password: Password to the user specified above
pac_token: For the Enterprise version of Pachyderm, create an authentication token for a user. Otherwise, if you use the community edition, leave it blank.
kserve_namespace: Namespace where MLDM will deploy models to

This will be used by the MLDM pipelines (that will then map the variables to the MLDE experiment):

kubectl apply -f pipeline-secret.yaml

Next, the MLDM Worker service account (which will be used to run the pods that contain the pipeline code) needs to gain access to the new 'models' namespace, or it won't be able to deploy models there.

First, create the configuration file:

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kserve-inf-service-role
  namespace: ${KSERVE_MODELS_NAMESPACE}
  labels:
    app: kserve-inf-app
rules:
- apiGroups: ["serving.kserve.io"]
  resources: ["inferenceservices"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kserve-inf-service-role
subjects:
- kind: ServiceAccount
  name: pachyderm-worker
  namespace: default
EOF

For the next step, the model deployments will need to download from the storage buckets. Since these models will run in the new 'models' namespace, the default service account in this namespace needs to be granted permissions to the bucket:

kubectl annotate serviceaccount default \
  -n ${KSERVE_MODELS_NAMESPACE} \
  iam.gke.io/gcp-service-account=${SERVICE_ACCOUNT}

gcloud iam service-accounts add-iam-policy-binding ${SERVICE_ACCOUNT} \
    --role roles/iam.workloadIdentityUser \
    --member "${MLDE_KS_WI}"

Finally, create dummy credentials to allow access to the MLDM repo through the S3 protocol.

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: pach-kserve-creds
  namespace: ${KSERVE_MODELS_NAMESPACE}
  annotations:
    serving.kserve.io/s3-endpoint: pachd.default:30600
    serving.kserve.io/s3-usehttps: "0"
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "blahblahblah"
  AWS_SECRET_ACCESS_KEY: "blahblahblah"
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pach-deploy
  namespace: ${KSERVE_MODELS_NAMESPACE}
  annotations:
    serving.kserve.io/s3-endpoint: pachd.default:30600
    serving.kserve.io/s3-usehttps: "0"
secrets:
- name: pach-kserve-creds
EOF

Step 21 - [Optional] Configure KServe UI

The quick installer we used for KServe does not include a UI to see the deployments. We can optionally deploy one, using the instructions described in this step.

We'll deploy the UI to the same namespace that is used to deploy the models (${KSERVE_MODELS_NAMESPACE})

First, we need to create the necessary roles, service accounts, etc. Run this command to setup the necessary permissions:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: models-webapp-sa
  namespace: ${KSERVE_MODELS_NAMESPACE}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: models-webapp-limited
  namespace: ${KSERVE_MODELS_NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: models-controller
rules:
- apiGroups: ["*"]
  resources: ["namespaces"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["serving.kserve.io"]
  resources: ["*"]
  verbs: ["*"]
- apiGroups: ["serving.knative.dev"]
  resources: ["*"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: namespace-viewer
rules:
- apiGroups: ["*"]
  resources: ["namespaces"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: models-viewer
  namespace: ${KSERVE_MODELS_NAMESPACE}
rules:
- apiGroups: ["serving.kserve.io", "serving.knative.dev"]
  resources: ["*"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: control-models
subjects:
- kind: ServiceAccount
  name: models-webapp-sa
  namespace: ${KSERVE_MODELS_NAMESPACE}
roleRef:
  kind: ClusterRole
  name: models-controller
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: view-models
  namespace: ${KSERVE_MODELS_NAMESPACE}
subjects:
- kind: ServiceAccount
  name: models-webapp-limited
  namespace: ${KSERVE_MODELS_NAMESPACE}
roleRef:
  kind: Role
  name: models-viewer
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: view-namespaces
  namespace: ${KSERVE_MODELS_NAMESPACE}
subjects:
- kind: ServiceAccount
  name: models-webapp-limited
  namespace: ${KSERVE_MODELS_NAMESPACE}
roleRef:
  kind: ClusterRole
  name: namespace-viewer
  apiGroup: rbac.authorization.k8s.io
EOF

Then, create a static IP for the KServe UI:

gcloud compute addresses create ${KSERVE_STATIC_IP_NAME} --region=${GCP_REGION}

export KSERVE_STATIC_IP_ADDR=$(gcloud compute addresses describe ${KSERVE_STATIC_IP_NAME} --region=${GCP_REGION} --format=json --flatten=address | jq '.[]' )

echo $KSERVE_STATIC_IP_ADDR

Next, create the deployment and the service using this command:

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: models-webapp
  namespace: ${KSERVE_MODELS_NAMESPACE}
spec:
  replicas: 1
  selector:
    matchLabels:
      app: models-webapp
  template:
    metadata:
      labels:
        app: models-webapp
    spec:
      serviceAccountName: models-webapp-sa
      containers:
      - name: models-webapp
        image: us-central1-docker.pkg.dev/dai-dev-554/pdk-registry/pdk_kserve_webapp:1.0
        env:
        - name: APP_SECURE_COOKIES
          value: "False"
        - name: APP_DISABLE_AUTH
          value: "True"
        - name: APP_PREFIX
          value: "/"
        command: ["gunicorn"]
        args:
        - -w
        - "3"
        - --bind
        - "0.0.0.0:8080"
        - "--access-logfile"
        - "-"
        - "entrypoint:app"
        resources:
          limits:
            memory: "1Gi"
            cpu: "500m"
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: model-webapp-service
  namespace: ${KSERVE_MODELS_NAMESPACE}
  labels:
    app: kserve-webapp
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  loadBalancerIP: ${KSERVE_STATIC_IP_ADDR}
  selector:
    app: models-webapp
  ports:
  - port: 8080
    targetPort: 8080
EOF

PS: If you would like to build your own image, this Github page contains the source:
https://github.com/kserve/models-web-app/tree/master

Next, get the URL for the KServe UI:

export KSERVE_UI_IP=$(kubectl -n ${KSERVE_MODELS_NAMESPACE} get svc model-webapp-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

export KSERVE_UI_URL="http://${KSERVE_UI_IP}:8080/"

echo $KSERVE_UI_URL

You can access the URL to see the deployed model (make sure to select the correct namespace).

Step 22 - [Optional] Prepare Docker and the Container Registry

The samples provided here already contain images you can use for training and deployment. This step is only necessary if you want to build your own images. In this case, you will find the Dockerfiles for each example in this repository.

First, make sure Docker Desktop is running.

Since each PDK use case will likely need to use specific images, a registry will be required to host these. In this tutorial, we will use Google Artifact Registry, but you can use any other alternative.

Run this command to configure docker for the Google Cloud registry:

gcloud auth configure-docker

Go to the Google Artifact Registry UI and create a new repository. Once the repository is created, the path to the repository will be:

<region>-docker.pkg.dev/<project_id>/<repository_name>

Since the cluster is also running on GCP, with your credentials, the images will be acessible by the PDK components even if they are private. However, they can also be made public, by granting the Artifact Registry Reader permission to allUsers. Do keep in mind that this will make the images public to anyone on the internet.

PS: when pushing images to GCP, it's a good idea to prefix with your name, to avoid confusing with other users' images.

Next, set the registry path as a variable and use these commands will download the busybox image from dockerhub and push it to GCP:

export REGISTRY_URL=${GCP_REGION}-docker.pkg.dev/${PROJECT_ID}/pdk-registry

docker pull busybox:latest

docker tag busybox:latest ${REGISTRY_URL}/busybox

docker push ${REGISTRY_URL}/busybox

You can also see the new image in your Docker Desktop dashboard:

You can also check the Artifact Registry UI in the Google Cloud Console (you might need to search for it) and make sure the new image is there:

Step 23 - Save data to Config Map

Now that all components are installed, we need a location to place some of the variables we've been using for the deployment. This config map can be used when configuring the PDK flows.

Create the configuration file:

cat <<EOF > ./pdk-config.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: pdk-config
  namespace: default
data:
  region: "${GCP_REGION}"
  mldm_bucket_name: "${MLDM_BUCKET_NAME}"
  mldm_host: "${STATIC_IP_ADDR_NO_QUOTES}"
  mldm_port: "80"
  mldm_url: "${PACH_URL}"
  mldm_pipeline_secret: "pipeline-secret"
  mlde_bucket_name: "${MLDE_BUCKET_NAME}"
  mlde_host: "${MLDE_STATIC_IP_ADDR_NO_QUOTES}"
  mlde_port: "80"
  mlde_url: "http://${MLDE_STATIC_IP_ADDR_NO_QUOTES}:80"
  kserve_ui_url: "${KSERVE_UI_URL}"
  model_assets_bucket_name: "${MODEL_ASSETS_BUCKET_NAME}"
  kserve_model_namespace: "${KSERVE_MODELS_NAMESPACE}"
  kserve_ingress_host: "${INGRESS_HOST}"
  kserve_ingress_port: "${INGRESS_PORT}"
  db_connection_string: ${CLOUDSQL_CONNECTION_NAME}
  registry_uri: "${REGISTRY_URL}"
  pdk_name: "${NAME}"
EOF

Next, create the configmap:

kubectl apply -f ./pdk-config.yaml

Once the config map is created, you can run kubectl get cm pdk-config -o yaml to verify the data.

Step 24 - Create Cleanup Script

In this ste, we create a script that will delete all components created as part of this installation.

cat <<EOF > ./_cleanup.sh
# DELETE CLUSTER
printf 'yes' | gcloud container clusters delete ${CLUSTER_NAME}

# Delete DB
printf 'yes' | gcloud sql instances delete ${CLOUDSQL_INSTANCE_NAME}

# Delete buckets
printf 'yes' | gcloud storage rm --recursive gs://${MLDM_BUCKET_NAME}
printf 'yes' | gcloud storage rm --recursive gs://${MLDE_BUCKET_NAME}
printf 'yes' | gcloud storage rm --recursive gs://${LOKI_BUCKET_NAME}
printf 'yes' | gcloud storage rm --recursive gs://${MODEL_ASSETS_BUCKET_NAME}

# Delete Static IPs
printf 'yes' | gcloud compute addresses delete ${STATIC_IP_NAME}
printf 'yes' | gcloud compute addresses delete ${MLDE_STATIC_IP_NAME}
printf 'yes' | gcloud compute addresses delete ${KSERVE_STATIC_IP_NAME}

# Delete Role
printf 'yes' | gcloud iam roles delete ${GSA_ROLE_NAME} --project ${PROJECT_ID}

# Delete Shared Disk
printf 'yes' | gcloud compute disks delete --zone=${GCP_ZONE} ${NAME}-pdk-nfs-disk

# Delete Service Accounts
printf 'yes' | gcloud iam service-accounts delete ${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com
printf 'yes' | gcloud iam service-accounts delete ${LOKI_GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com

EOF

chmod +x _cleanup.sh

When it's time to cleanup your environment, just run:

./_cleanup.sh

GCP - Useful Commands

Creating folders in the GCP bucket

GCP doesn't allow empty folders on buckets, and the MLDM pipeline will fail if the folder doesn't exist, so we'll create the folders with dummy files, to make sure the bucket is ready for the pipelines:

echo "hello world" > helloworld.txt

gsutil cp helloworld.txt gs://${MODEL_ASSETS_BUCKET_NAME}/dogs-and-cats/config/hello.txt

gsutil cp helloworld.txt gs://${MODEL_ASSETS_BUCKET_NAME}/dogs-and-cats/model-store/hello.txt

Retrieve MLDE Admin Password

The MLDE admin password is stored in a secret, with base64 encoding. Use this command to retrieve the decoded password value:

kubectl get secret pipeline-secret -o jsonpath="{.data.det_password}" | base64 --decode

The installation steps are now completed. At this time, you have a working cluster, with MLDM, MLDE and KServe deployed.

Next, return to the main page to go through the steps to prepare and deploy the PDK flow for the dogs-and-cats demo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy_gcp.md

deploy_gcp.md

PDK - Pachyderm | Determined | KServe

Deployment Guide for Google Cloud

Reference Architecture

Prerequisites

Installing the Cluster

IMPORTANT: These steps were created and tested on an M1 MacOS computer. Some of the commands might work differently (or not at all) in other operating systems. Check the command documentation for an alternative syntax, if you are using a different OS.

NOTE: It's recommended to run these instructions one at a time, so you can diagnose in case of issues. The syntax for some of the commands documented here might become invalid, as new versions of these applications are released.

Step 1 - Set Environment Variables

Step 2 - Test the pre-req applications and configure the gcloud client

Step 3 - Create the main service account and custom role

Step 4 - Create the GKE cluster

Step 5 - Create the GPU node pool in the cluster

Step 6 - Create Storage buckets

Step 7 - Create Postgres Database

Step 8 - Create static IP for MLDM

Step 9 - Configure security settings for MLDM - Loki

Step 10 - Configure security settings for the MLDE GPU Node Pool

Step 11 - Deploy KServe

Step 12 - Create static IP for MLDE

Step 13 - Deploy nginx, configured to use the static IP

Step 14 - Prepare MLDE installation assets

Step 15 - Create configuration .yaml file for MLDM and MLDE

Step 16 - Install MLDM and MLDE using Helm

Step 17 - Create new Ingress for MLDE

Step 18 - Retrieve MLDM and MLDE IP addresses and configure command line clients

Step 19 - (Optional) Test Components

Step 20 - Prepare for PDK Setup

Step 21 - [Optional] Configure KServe UI

Step 22 - [Optional] Prepare Docker and the Container Registry

Step 23 - Save data to Config Map

Step 24 - Create Cleanup Script

GCP - Useful Commands

Creating folders in the GCP bucket

Retrieve MLDE Admin Password

Files

deploy_gcp.md

Latest commit

History

deploy_gcp.md

File metadata and controls

PDK - Pachyderm | Determined | KServe

Deployment Guide for Google Cloud

Reference Architecture

Prerequisites

Installing the Cluster

IMPORTANT: These steps were created and tested on an M1 MacOS computer. Some of the commands might work differently (or not at all) in other operating systems. Check the command documentation for an alternative syntax, if you are using a different OS.

NOTE: It's recommended to run these instructions one at a time, so you can diagnose in case of issues. The syntax for some of the commands documented here might become invalid, as new versions of these applications are released.

Step 1 - Set Environment Variables

Step 2 - Test the pre-req applications and configure the gcloud client

Step 3 - Create the main service account and custom role

Step 4 - Create the GKE cluster

Step 5 - Create the GPU node pool in the cluster

Step 6 - Create Storage buckets

Step 7 - Create Postgres Database

Step 8 - Create static IP for MLDM

Step 9 - Configure security settings for MLDM - Loki

Step 10 - Configure security settings for the MLDE GPU Node Pool

Step 11 - Deploy KServe

Step 12 - Create static IP for MLDE

Step 13 - Deploy nginx, configured to use the static IP

Step 14 - Prepare MLDE installation assets

Step 15 - Create configuration .yaml file for MLDM and MLDE

Step 16 - Install MLDM and MLDE using Helm

Step 17 - Create new Ingress for MLDE

Step 18 - Retrieve MLDM and MLDE IP addresses and configure command line clients

Step 19 - (Optional) Test Components

Step 20 - Prepare for PDK Setup

Step 21 - [Optional] Configure KServe UI

Step 22 - [Optional] Prepare Docker and the Container Registry

Step 23 - Save data to Config Map

Step 24 - Create Cleanup Script

GCP - Useful Commands

Creating folders in the GCP bucket

Retrieve MLDE Admin Password