One of the most challenging problems in the popular orchestration framework Kubernetes is assigning sufficient resources to containers to operate at a required level while also avoiding excessive resource allocation which can delay other jobs in the cluster. A variety of heuristic approaches have been proposed to tackle this problem but these require considerable manual adjustments which can be laborious. Reinforcement learning approaches have been proposed to address this issue but these proposals do not consider the energy consumption of the cluster. This is an important component of the problem due to the commitments of large cloud operators to carbon neutrality. We have proposed a system called Smart-Kube to achieve a target utilization on nodes while maintaining energy consumption at a reasonable level. An experimental framework is designed on top of real-world Kubernetes clusters and real-world traces of container jobs are used to evaluate the framework. Experimental results show that Smart-Kube can approach the target utilization and reduce energy consumption in a variety of ways depending on the preferences of the cluster operator for a variety of cluster sizes.
-
Download source code from GitHub
git clone https://github.com/saeid93/smart-scheduler
-
Download and install miniconda
-
Create conda virtual-environment
conda create --name smartscheduler python=3
-
Activate conda environment
conda activate smartscheduler
-
If you want to use GPUs make sure that you have the correct version of CUDA and cuDNN installed from here. Alternatively, you can check cudnn-compatibility to find compatible versions and install CUDA and cuDNN with conda from cudatoolkit and cudnn, respectively. Make sure ther versions of python, cuda, cudnn, and tensorflow in your conda environment are compatible.
-
Use PyTorch or Tensorflow isntallation manual to install one of them based-on your preference
-
Install the followings
sudo apt install cmake libz-dev
-
Install requirements
pip install -r requirements.txt
-
setup tensorboard monitoring
If you want to do the real world Kubernetes experiemnts of the paper you should also do the following steps.
There are several options for setting up a Kuberentes cluster. The repo codes can connect to the cluster through the Python client API as long as you have access to the kube config address e.g. ~/.kube/config
in your config files specified.
We have used Google Cloud Platform for our experiments. You can find the toturial for creating the cluster on google cloud and locally in the followings:
If you want to train then checkout tips for training
The code is separated into three modules
- data: This is the folder containing all the configs and results of the project. Could be anywhere in the project.
- smart-scheduler: the core simulation library with Open-AI gym interface
- experiments: experiments of the paper and the reinforcement learning side of codes.
3.1. smart-scheduler
- src: The folder containing the smart-scheduler simulators. This should be installed for using.
Go to the smart-scheduler and install the library in the editable mode with
pip install -e .
3.2. data
Link the data folder (could be placed anywhere in your harddisk) to the project. A sample of the data folder is available at data.
Go to experiments/utils/constants.py and set the path to your data and project folders in the file. For example:
DATA_PATH = "/Users/saeid/Codes/smart-scheduler/data"
3.4. experiments
3.4.1. Dataset Preprocessing
3.4.1.1 Arabesque
3.4.1. Data Generation
The cluster and workloads are generated in the following order:
- Clusters: Nodes, services, their capacities, requested resources and their initial placements.
- Workloads: The workload for each cluster that determines the resource usage at each time step. This is built on top of the clusters built on step 1. Each cluster can have several workloads.
To generate the clusters, workloads, networks and traces, first go to your data folder (remember data could be anywhere in your disk just point the data folder as experiments/utils/constants.py).
3.4.1.1. Generating the clusters
Go to the your cluster generation config data/configs/cluster-generation/ make a folder named after your config and make the config.json
in the folder e.g. see the my-cluster
in the sample data folder data/configs/generation-configs/cluster-generation/my-cluster/config.json. Then run the experiments/cluster/generate_cluster.py with the following script:
python generate_cluster.py [OPTIONS]
Options:
--cluster-config-folder TEXT config-folder
[default: my-cluster]
For a full list of config.json
parameters options see cluster-configs-options. The results will be saved in data/clusters/<cluster_id>.
Go to the your workload generation config data/configs/generation-configs/workload-generation make a folder named after your config and make the config.json
in the folder e.g. see the my-workload
in the sample data folder data/configs/generation-configs/workload-generation/my-workload/config.json. For a full list of config.json
see. Then run the experiments/cluster/generate_cluster.py with the following script:
python generate_workload.py [OPTIONS]
Options:
--workload-config-folder TEXT config-folder
[default: my-workload]
For a full list of config.json
parameters options see workload-configs-options. The results will be saved in data/clusters/<cluster_id>/<workload_id>.
Change it to Alibaba and Arabesque
Go to the your workload generation config data/configs/generation-configs/workload-generation make a folder named after your config and make the config.json
in the folder e.g. see the my-workload
in the sample data folder data/configs/generation-configs/workload-generation/my-workload/config.json. For a full list of config.json
see. Then run the experiments/cluster/generate_cluster.py with the following script:
python generate_workload.py [OPTIONS]
Options:
--workload-config-folder TEXT config-folder
[default: my-workload]
For a full list of config.json
parameters options see workload-configs-options. The results will be saved in data/clusters/<cluster_id>/<workload_id>.
4.4.2.1. Training the agent
- change the training parameters in
<configs-path>/real/<experiment-folder>/config_run.json
. For more information about the hyperparamters in this json file see hyperparameter guide - To train the environments go to the parent folder and run the following command.
python experiments/learning/learners.py --mode real --local-mode false --config-folder PPO --type-env 0 --cluster-id 0 --workload-id 0 --use-callback true
4.4.2. Analysis
4.4.2.1 check_env
4.4.2.2 check_learned
4.4.2.3 test_baselines
4.4.2.4 check environment
4.4.2.5 check environment
4.4.2.6 check environment
4.4.4. Kubernetes interface
The Kubernetes interface is designed based-on the Kubernetes api version 1.
The main operations that are currently implemented are:
- creating
- cluster
- utilisation server
- pods
- actions
- scheduling pods to nodes
- moving pods (not used in this work)
- deleting pods
- cleaning nemespace
- monitoring
- get nodes resource usage
- get pods resource usage
a sample of using the interface can be found here
Log of a running emulation - moving service 0 from node 1 to node 0 (s0n1 -> s0n1)
Google cloud console of a running emulation - moving service 0 from node 1 to node 0 (s0n1 -> s0n1)