This repository hosts example projects used for exploring KServe and Nvidia NIM with the goal of integrating Nvidia NIM into Red Hat OpenShift AI.
- The pocs folder hosts the various POC scenarios designed with Kustomize.
- The builds folder hosts built manifests from the above-mentioned pocs for accessibility.
All POC executions require Red Hat OpenShift AI.
Kserve supports three types of deployment. We explored two of them. Serverless, and Raw.
Serverless Deployment, the default deployment type for Kserve, it leverages Knative.
Model Used | kserve-sklearnserver |
POC Instructions | Click here |
Built Manifests | Click here |
Key Takeaways
- The storageUri specification from the InferenceService is used for triggering Kserve's Storage Initializer Container for downloading the model prior to runtime.
With Raw Deployment, Kserve leverages Kubernetes core resources.
Model Used | kserve-sklearnserver |
POC Instructions | Click here |
Built Manifests | Click here |
Key Takeaways
- The storageUri specification from the InferenceService is used for triggering Kserve's Storage Initializer Container for downloading the model prior to runtime.
- Annotating the InferenceService with
serving.kserve.io/deploymentMode: RawDeployment
triggers a Raw Deployment.
Prerequisites!
Before proceeding, grab your NGC API Key and create the following two secret data files (git-ignored):
The files are saved in the no-cache POC folder but are used by all scenarios in this context.
# the following will be used in an opaque secret mounted into the runtime
echo "NGC_API_KEY=ngcapikeygoeshere" > pocs/persistence-and-caching/no-cache/ngc.env
# the following will be used as the pull image secret for the underlying runtime deployment
echo "{
\"auths\": {
\"nvcr.io\": {
\"username\": \"\$oauthtoken\",
\"password\": \"ngcapikeygoeshere\"
}
}
}" > pocs/persistence-and-caching/no-cache/ngcdockerconfig.json
In this scenario, Nvidia NIM is in charge of downloading the required models; however, the target volume is not persistent, and the download process will occur for every Pod created and will be reflected in scaling time.
Model Used | nvidia-nim-llama3-8b-instruct |
POC Instructions | Click here |
Built Manifests | Click here |
Key Takeaways
- The storageUri specification from the InferenceService is NOT required.
- We set the NIM_CACHE_PATH environment variable is set to /mnt/models (empty-dir).
In this scenario, Nvidia NIM is in charge of downloading the required models; the download target is a PVC.
kubernetes.podspec-persistent-volume-claim: "enabled"
kubernetes.podspec-persistent-volume-write: "enabled"
Model Used | nvidia-nim-llama3-8b-instruct |
POC Instructions | Click here |
Built Manifests | Click here |
Key Takeaways
- The storageUri specification from the InferenceService is NOT required.
- We added a PVC setting the storage class to OpenShift's default gp3-csi.
- We added a Volume to the ServingRuntime connected to the above-mentioned PVC.
- We added a VolumeMount to the ServingRuntime mounting the above-mentioned Volume to /mnt/nim/models.
- We set the NIM_CACHE_PATH environment variable is set to above-mentioned /mnt/nim/models.
In this scenario, Nvidia NIM is in charge of downloading the required models; the download target is a PVC. Using writable PVCs is applicable with Kserve's Raw Deployment.
Model Used | nvidia-nim-llama3-8b-instruct |
POC Instructions | Click here |
Built Manifests | Click here |
Key Takeaways
- The storageUri specification from the InferenceService is NOT required.
- We added a PVC setting the storage class to OpenShift's default gp3-csi.
- We added a Volume to the ServingRuntime connected to the above-mentioned PVC.
- We added a VolumeMount to the ServingRuntime mounting the above-mentioned Volume to /mnt/nim/models.
- We set the NIM_CACHE_PATH environment variable is set to above-mentioned /mnt/nim/models.
- Annotating the InferenceService with
serving.kserve.io/deploymentMode: RawDeployment
triggers a Raw Deployment. - We added maxReplicas for the Predictor, which is required for using HPA.