The simulator evaluates different scheduling policies in GPU-sharing clusters. It includes the Fragmentation Gradient Descent (FGD) policy proposed in the USENIX ATC 2023 paper "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent", along with other baseline policies (e.g., Best-fit, Dot-product, GPU Packing, GPU Clustering, Random-fit).
Step 1: Init and Run Experiments | Step 2: Result Analysis |
---|---|
Please ensure that Go is installed.
go mod vendor
installs the dependencies required for the simulator.
$ go mod vendor
make
generates the compiled binary files in the bin
directory.
$ make
To save your time in the environment setup, we have just prepared a docker image with Golang 1.20.4, Python 3.10.11, and required libraries installed.
Besides, we have copied the GitHub repo under the home directory and compile the executable binary file (bin/simon
), therefore, go mod vendor
and make
commands are no longer needed.
For the users not familiar with Docker, please refer to the official installation guide on Linux, Mac, or Windows platform. For the others, the following commands are for your reference.
# step 1: pull image
sudo docker pull qzweng/kubernetes-scheduler-simulator:atc23
# step 2: launch the docker container
sudo docker run -d --name=kss qzweng/kubernetes-scheduler-simulator:atc23 bash -c "sleep infinity"
# step 3: execute commands inside the container
sudo docker exec -it kss bash
# step 4: go to the project folder and conduct experiments
cd ~/kubernetes-scheduler-simulator
The following example will schedule 6 pods to a cluster with 2 nodes, and the expected output will show the allocation ratio of each resource dimension (CPU, memory, GPU). The default scheduling policy is fragmentation gradient descent (FGD).
$ bin/simon apply --extended-resources "gpu" \
-f example/test-cluster-config.yaml \
-s example/test-scheduler-config.yaml
Install the required Python dependency environment.
$ pip install -r requirements.txt
- Please refer to README under the
data
directory to prepare production traces. - Then refer to README under the
experiments
directory to reproduce the results reported in the paper.
Please cite our paper if it is helpful to your research.
@inproceedings{FGD2023,
title = {Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent},
author = {Qizhen Weng and Lingyun Yang and Yinghao Yu and Wei Wang and Xiaochuan Tang and Guodong Yang and Liping Zhang},
booktitle = {2023 {USENIX} Annual Technical Conference},
year = {2023},
series = {{USENIX} {ATC} '23}
url = {https://www.usenix.org/conference/atc23/presentation/weng},
publisher = {{USENIX} Association},
}
Our simulator is developed based on open-simulator by Alibaba, a simulator used for cluster capacity planning. This repository primarily evaluates the performance of different scheduling polices on production traces. GPU-related plugin has been merged into the main branch of open-simulator.
- Add a minikube running example to demonstrate how the simulator schedules pods in a real Kubernetes cluster.