The purpose of this repository is to train an agent with reinforcement learning so that it can outperform a standard Horizontal Pod Autoscaler (HPA) by allocating resources in a smarter way. This is done by looking at multiple metrics instead of solely relying on CPU utilisation.
The microservice architecture that it is used is built upon GoogleCloudPlatform/microservices-demo v0.5.2. Based on that repository, an RL agent has been developed to take scaling decisions and actually scale-out or -in the services of the Online Boutique when run in Kubernetes.
In the following, there is a guide to deploy the agent in a Google Kubernetes Engine (GKE) cluster and locally using minikube. The local development environment is used mainly for testing the interaction between the agent, the Prometheus API and the Kubernetes API before deploying it in the "production" environment in GKE. Indeed, the GKE cluster is used to check the actual behaviour of the algorithm in a more realistic environment, and compare the standard HPA with the intelligent autoscaler.
So take your cup of coffee or tea and let's begin! βοΈ
In the cloud platform, I deployed the Kubernetes cluster, which consists of a Master node and multiple worker nodes, which will host all the pods needed by the microservice application Online Boutique. Each pod contains one or multiple Docker containers, represented by a container with rounded angles in the Figure. For example, the container RL agent contains the code to use the model on the cluster when trained, while Envoy represents the sidecar proxy injected by Istio on every pod, next to the Server container. Each Server represents a microservice packaged into a container, for example, productcatalog, frontend, etc. Each pod is assigned to a node by the kube-scheduler based on resource availability, and the whole Online Boutique application runs in the same namespace.
The communication between microservices occurs through the service mesh enabled by Istio. Specifically, Envoy proxies monitor each server and publish these data on an endpoint. Both Prometheus and Grafana are deployed as pods inside the cluster. Prometheus scrapes every 5s the metrics from Istio, which collects these from each Envoy proxy present in the cluster. These metrics are stored in the Prometheus time-series database and are scraped by Grafana and the RL agent. With Grafana, we can visualise the rate of change of each metric over time in a dashboard, and download this data in our local development environment to process it and make our measurements.
The scaling action is performed by the RL agent container based on the observed state of the cluster.
In fact, the agent queries Prometheus to receive aggregated pod metrics, representing state
Here, the agent is deployed in a cloud environment that is similar to a real-world scenario where scalability, availability and security are the main concerns.
β οΈ Cloud computing comes with a cost πΈπUsing GKE, you will be charged for the resources you use. The main costs will occur for the following resources:
- Compute Engine, for a VM instance
- Filestore, for an NFS server instance
- Kubernetes Engine, for nodes utilisation
- Container Registry, for container images storage
Picture of the infrastructure.
First things first, you have to create a project with your Google account on Google Cloud Platform (GCP) as indicated in the official documentation.
Next, the following components are needed to have everything working on GKE. Specifically, we will connect to the Kubernetes cluster with a virtual machine and manage our pods and the agent from there.
This VM will be used to run commands like kubectl
to interact with the cluster and git pull
to get the latest changes from the repository when developing.
- Create a VM instance with at least 2 vCPUs, 4GB of memory, 20GB of disk and Ubuntu as OS.
- SSH to the machine from your local computer or from the browser. You can use
gcloud compute ssh --zone "<your_zone>" "<vm_instance_name>" --project "<your_proect-name>"
or directly from the user interface in GCP.
- gcloud SDK: follow these instructions to install the SDK and then
- Install the package to authenticate with GKE
sudo apt-get install google-cloud-cli-gke-gcloud-auth-plugin
- Install the Kubernetes command line if not already present with
sudo apt-get install kubectl
- Install the package to authenticate with GKE
- GitHub CLI: follow these instructions to install gh, which will be necessary to authenticate to GitHub, push and pull changes to the repository. The complete command is reported below.
type -p curl >/dev/null || (sudo apt update && sudo apt install curl -y)
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg \
&& sudo chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null \
&& sudo apt update \
&& sudo apt install gh -y
- Docker: follow these instructions to install Docker and
sudo chmod 666 /var/run/docker.sock
after the installation to provide the right permissions to access Docker socket.
A persistent disk that can be shared across all components is necessary because virtual machines and containers are by definition ephemeral and there is no guarantee that data will be available across sessions. An NFS server will be used to share storage between the VM and pods.
You can follow these guidelines to set up a NFS server with Google Filestore. On the VM:
- Change directory and go into the mounted volume
cd /mnt/<dir_name>
- Create folder
keys/
to store secret keys created in the following steps. - Clone this repository with
git clone
.
β οΈ If the VM is restarted you need to remount the volume (the address of the NFS server may have changed)
sudo mount <nfs_server_ip_address>:/vol1 /mnt/nfs-client
sudo chmod go+rw /mnt/nfs-client
This volume will be claimed by certain pods using a Persistent Volume Claim.
You can follow the official documentation to set up your standard cluster. Nodes of type e2-standard-4
are sufficient with capacity to run the application and scale it.
Now it's time to authenticate into the cluster from the VM. Run the following commands:
- Authenticate to the platform with your Google account with
gcloud auth login
- Enter the project you created with
gcloud config set project PROJECT_ID
- Log into the cluster with
gcloud container clusters get-credentials <cluster-name> --zone <zone_of_your_cluster>
- You can try to create a namespace with
kubectl create namespace rl-agent
To push Docker images in the Container Registry at gcr.io/<your_project_name>
you will need to authenticate Docker to the Registry.
You can follow the official documentation to show how to authenticate or simply run gcloud auth configure-docker
.
To enable Kubernetes to pull the container images from your private Container Registry you have to create a JSON key file and embed it as a Kubernetes secret.
Create a Service Account with Viewer permissions and create the key file to be stored in /keys/
with gcloud iam service-accounts keys create KEY_FILE --iam-account=SA_NAME@PROJECT_ID.iam.gserviceaccount.com
.
With Kubernetes point to the key file to create a secret and name it gcr-json-key
.
You will need to run the following command both for the default
namespace and rl-agent
so that both namespaces will be able to pull the needed images.
kubectl create secret docker-registry gcr-json-key \
--docker-server=gcr.io \
--docker-username=_json_key \
--docker-password="$(cat ~/keys/img_pull_key.json)" \
--docker-email=<service-account-email> \
--namespace rl-agent
Check that these secrets have been created correctly with kubectl get secrets
.
Another key file is needed, this time for the agent to authenticate to the GKE cluster from within its pod. You can use the default service account to create a second JSON key and save it to /keys/
.
β οΈ SecurityThis secret will be copied directly into the image and this is a security concern. A more secure implementation would be to create a Kubernetes secret and import this secret into the pod.
Monitoring will be used for two main purposes. One is to provide the agent with the necessary information about the current state of the cluster to take the necessary action. The other is to monitor the behaviour of the intelligent autoscaler and compare it with the standard HPA.
These components are Istio, an open-source service mesh framework providing monitoring and load balancing capabilities, Prometheus and open-source time-series database used to store monitored data scraped from Istio, and Grafana, an open-source project providing a monitoring dashboard connected to the database. Moreover, the Prometheus API will be queried by the agent to get the state of the environment. How the taken action affects the environment will be visible on a Grafana dashboard.
By default, Kubernetes emits only certain basic metrics but the agent needs to know the current number of replicas that are being deployed. To enable this metrics you have to install the kube-state-metrics server. The installation is very simple:
- Download the official repository with
git clone https://github.com/kubernetes/kube-state-metrics
- Apply its manifests for the standard installation
kubectl apply -f kube-state-metrics/examples/standard/
You should now see a pod named similar to kube-state-metrics-65bf754b96-cvp75
when you list all pods with kubectl get pods -A
The file prometheus.yaml
instructs Prometheus to scrape from this server, among other jobs.
Next install Istio 1.17.1 by following the official documentation.
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.17.1 TARGET_ARCH=x86_64 sh -
Use the default profile istioctl install --set profile=default -y
and label both namespaces for the injection of the Envoy sidecar with kubectl label namespace <namespace_name> istio-injection=enabled
. The sidecar is the component that will emit those metrics scraped by Prometheus.
Prometheus, Kiali and Grafana are enabled with kubectl apply -f microservice-demo/kubernetes-manifests/istio-system
.
Everything is ready now to deploy the Online Boutique application with all its services!
A replica of the application will be deployed in the default
namespace and another in the rl-agent
namespace. The agent will only be present in the latter.
- Change directory to the repository folder
cd microservice-demo/
- . Build the Docker image for the agent
cd src/agent
docker build -t agent:x.x .
where x.x is the version of the image- Tag the image with the path on your Container Registry like
docker tag agent:x.x gcr.io/master-thesis-hpa/agent:x.x
- Now push the image to the Registry with
docker push gcr.io/master-thesis-hpa/agent:x.x
- Do the same with the loadgenerator
cd src/loadgenerator
,docker build
,docker tag
anddocker push
. - Deploy the replica of the application in the
default
namespace withskaffold run --default-repo gcr.io/master-thesis-hpa
- Now deploy the application in the
rl-agent
namespace. Go tocd kubernetes-manifests/rl-agent
- Apply the roles that allow the agent to change the number of replicas with
kubectl apply -f agent-role.yaml
andagent-roleBind.yaml
- Deploy all services with
kubectl apply -f ./app
- Apply the standard HPA configuration for the
frontend
service in the default namespace withkubectl apply -f /autoscaling
β οΈ Update image version in the deployment files after every new versioningBoth in
/kubernetes-manifests/default/loadgeneratory.yaml
andkubernetes-manifests/rl-agent/app/loadgenerator.yaml
for the load generator and inkubernetes-manifests/rl-agent/app/agent.yaml
.
Update the image version in /kubernetes-manifests/loadgenerator.yaml
with the x.x version of your build and deploy the service with kubectl apply -f default/loadgenerator.yaml
for the default
namespace and kubectl apply -f rl-agent/app/loadgenerator.yaml
for the other namespace.
Once deployed, both generators will simulate a load of requests to frontend
coming from a varying number of users. These users will look for products, add the to the cart and buy them. User behaviour is randomised across all possible interactions with the website.
The agent can be deployed with kubectl apply -f rl-agent/app/agent.yaml
.
Deploying this pod will automatically start the learning phase of the algorithms, which will log and checkpoint models at microservice-demo/src/agent/
.
You can execute commands inside the pod with kubectl exec -it -n rl-agent agent-<id> -- /bin/sh
.
minikube start
-
minikube mount ${HOME}/Documents/microservice-demo:/app/microservice-demo
to mount the volume that will be used by the agent pod to run the code developed locally - Open a new terminal
-
minikube docker-env
andeval $(minikube -p minikube docker-env)
cd src/agent-local
-
docker build -t agent:<version> .
Remember to change the version of the image in the deployment file to deploy that image in the cluster. cd src/loadgenerator
docker build -t loadgenerator:<version> .
- Install Istio following the official documentation.
istioctl install --set profile=demo -y
this will allow you to use Prometheus and Grafanakubectl label namespace rl-agent istio-injection=enabled
- Verify the labelling with
kubectl get namespaces --show-labels
cd kubernetes-manifests/
kubectl -f apply ./istio-system
to deploy Istio components
- Go to
kubernetes-manifests/rl-agent/local
kubectl apply -f kube-manifests-local.yaml
kubectl apply -f loadgenerator-local.yaml
kubectl apply -f agent-role.yaml
kubectl apply -f agent-roleBind.yaml
kubectl apply -f agent-local.yaml
- On a new terminal, to run the load generator execute the file
workload_gen.py
from within the pod.kubectl exec -it -n rl-agent loadgenerator-<id> -- /bin/sh
- On a new terminal, to run the agent code execute
kubectl exec -it -n rl-agent agent-<id> -- /bin/sh
Everything should be running locally by now. π The monitoring part is the same as the one developed in GKE.
[!note] Note about volumes When creating a volume, everything you modify in the pod (including deleted files) will be reflected on the host machine where the volume is created!
[!tip] Deleting all resources To delete all resources in the cluster in a namespace
kubectl delete all --all
.
Tutorials I used and other references. How to a pull Docker Image from GCR in any non-GCP Kubernetes cluster. Google Kubernetes Engine(GKE) β Persistent Volume with Persistent Disks (NFS) on Multiple Nodes (ReadWriteMany). Kubernetes Monitoring: Kube-State-Metrics.