To verify the value of PaddlePaddle's auto-scaling mechanism.
- Cluster computational resource utilization.
- Expecting higher resource utilization.
- Higher utilization means lesser resources are idle. Autoscaling is intended to maximize the overall cluster resource(CPU, GPU, memory) usage by ensuring resources for production level jobs/services, followed by fair scaling of jobs that are scalable to use the resources left in the cluster.
- In this experiment, we are focusing on CPU utilization.
- Average pending time for training jobs.
- Expecting lesser pending time.
- Long pending time is a common pain point for researchers with the internal cluster.
- The lesser the pending time, the earlier developers and researchers can start seeing the training cost graphs, and the better they can verify the effectiveness of the training algorithm.
- In this experiment, we will verify if the auto-scaler will kill existing PaddlePaddle training processes to make room for newly submitted jobs.
- Quality of service of online services.
- When PaddlePaddle jobs are deployed along with online services, we check whether PaddlePaddle training jobs will yield resources to more important online services when the load gets intensive.
- Kubernetes cluster with v1.6.2 installed, with 133 physical nodes.
- PaddleCloud with the latest develop branch installed.
- A medium sized neural networks model is used in the experiment.
All the jobs in the cluster will be training jobs (hence the name, special purpose cluster). This case is a very typical scenario for research labs.
- Autoscaling ON/OFF.
- The number of jobs.
- The configurations for each job are the same, except:
- each autoscaling job asks for 2 - 60 trainers, and
- each non-autoscaling job asks for 60 trainers.
- The submission time of each job.
- With autoscaling turned on, submit the training jobs with 10 seconds delay between each job, 20 jobs in total. Repeat the experiment for 10 passes.
- With autoscaling turned off, submit the training jobs with 10 seconds delay between each job, 20 jobs in total. Repeat the experiment for 10 passes.
In the above graph, the solid line is for non-autoscaling experiment passes, the dashed line is for autoscaling experiment passes.
We can see that the pending job counts for the autoscaling jobs are significantly lower than the non-autoscaling jobs while still maintaining high CPU utilization.
The pending job count for non-autoscaling increases after 100 seconds of the experiment and plateaus at 14 until the end. This means that the cluster's resource have exhausted when 6 jobs are running, hence newly submitted jobs have to wait. Meanwhile in autoscaling experiment, even when resources are exhausted, a newly submitted job can still start, because autoscaler scales existing job down to make room for the new jobs.
The reason we see ridges in the graph is because jobs are not deployed in one shot, there is a 10s delay between job submissions. Hence, it will take some time for the job to be actually created. The time in non-autoscaling experiment is simply Kubernetes' job launching time; while the time in autoscaling experiment case is auto-scaler killing other PaddlePaddle pods and Kubernetes job launching time.
Also in the above graph, we can see that after the utilization stabilizes, the cluster utilization of autoscaling jobs is slightly lower than the non-autoscaling jobs. We think this is because of two reasons:
-
Computation resources are wasted when autoscaler is trying to free up resources for new incoming jobs.
-
Autoscaling and non-autoscaling jobs have different distributions of the numbers of trainer, pserver and master pods. The computation resources for each node are fragmented differently, leading to a utilization around 88% for non-autoscaling jobs vs around 86% for autoscaling jobs when stabilized.
-
Autoscaling ON
PASS AVG PENDING TIME CLUSTER CPU UTILS 0 24 75.5646 1 59 75.9876 2 31 75.0465 3 63 76.0976 4 32 76.6245 5 85 76.1902 6 67 76.3599 7 45 77.8456 8 38 76.8869 9 28 76.7175 AVG 42.9091 76.3321 -
Autoscaling OFF
PASS AVG PENDING TIME CLUSTER CPU UTILS 0 319 76.2028 1 305 75.8829 2 295 79.8287 3 309 75.0948 4 315 75.5644 5 319 75.2832 6 298 75.3558 7 311 75.578 8 316 76.9651 9 298 75.8706 AVG 280.455 76.1626
Hybrid deployment with online serving and offline training job (hence the name, general purpose cluster). We will deploy PaddlePaddle training jobs and Nginx web serving together. This case is a very typical scenario for large enterprises and internet companies.
- The number of Nginx instances, changing over time, simulating the real world traffic load distribution over time.
- Autoscaling ON/OFF.
- The number of training jobs.
- The configurations for each training job are the same, except:
- each autoscaling job asks for 2 - 60 trainers, and
- each non-autoscaling job asks for 60 trainers.
- The submission time for each training job.
- The configuration of each Nginx job.
-
With autoscaling turned on, start 400 Nginx instances to simulate the number of Nginx instances required for the peak time load.
-
Start the training jobs.
-
Decrease the Nginx instance count of 400 to 100 over time. This simulates Nginx load decreases, requiring fewer Nginx instances.
-
Increase the Nginx instances count of 100 to 400 over time. This simulates the full Nginx load cycle.
-
Repeat the experiment for 10 passes.
-
Repeat the above steps with autoscaling turned off.
The solid line is for non-autoscaling experiment passes and the dashed line is for autoscaling experiment passes.
The above graph shows the number of Nginx instances changing over time, simulating a typical online cluster usage. Meanwhile, when auto-scaling is enabled, the PaddlePaddle job trainer pods count is changing along in the opposite trend.
Also when autoscaling is turned on, the cluster utilization is kept high even though the online Nginx service is scaled down.
We notice that Nginx pods count curve with autoscaling-on is not as sharp as the one with autoscaling-off, especially when Nginx is trying to increase its pods. This is where we still need to improve the performance of the auto-scaler.
-
Autoscaling ON
PASS AVG PENDING TIME CLUSTER CPU UTILS 0 33 83.7926 1 38 83.0557 2 29 82.8201 3 22 84.3083 4 62 82.8449 5 21 83.2045 6 70 83.0649 7 69 83.8079 8 101 83.5989 9 70 83.7494 AVG 53.55 83.4247 Off-peak (300s - 370s) average cluster utilization: 79.3505
-
Autoscaling OFF
PASS AVG PENDING TIME CLUSTER CPU UTILS 0 1 62.3651 1 0 61.7813 2 1 61.6985 3 0 61.4403 4 2 61.8323 5 3 61.7459 6 2 61.5679 7 2 62.1981 8 3 61.9676 9 1 62.0316 AVG 1.5 61.8629 Off-peak (300s - 370s) average cluster utilization: 44.9134
We also notice the increase in average pending time. The reason behind this is the mechanism of gradual deployment of tasks to minimize the impact to online services.
Every number and plot in the report is calculated from raw data. You can find the calculation programs and raw data here
As shown in Case 2 in a general purpose cluster, the CPU utilization
increased by 34.8% ((83.4247-61.8629)/61.8629
) on average; During
the off-peak time, the CPU utilization even surged by 76.7%
((79.3505-44.9134)/44.9134
).
Clearly, now PaddlePaddle machine learning jobs are running in the computing resource reservoir prepared for the rainy day. When the situation is getting tough, machine learning tasks will size down themselves without fault and give resources back automatically.
As shown in case 1 in a special purpose cluster, the average pending
time reduced by 84.7% ((280.455-42.9091)/280.455
) on average.
As shown in test case 2, PaddlePaddle yields resource to more important online services when the load is getting intensive.
- Configure kubectl and paddlectl on your host.
- Submit the TrainingJob controller with the YAML file.
> git clone https://github.com/PaddlePaddle/cloud.git && cd cloud
> kubectl create -f k8s/controller/trainingjob_resource.yaml
> kubectl create -f k8s/controller/controller.yaml
Run the TestCase1 or TestCase2 for serval passes with the bash script
./run.sh
:
For example, run TestCase1 for 10 passes and 10 jobs:
> cd cloud/doc/autoscale/experiment
> TAG=round_1 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=20 ./run.sh start case1
Or submit an auto-scaling training job
> cd cloud/doc/autoscale/experiment
> TAG=round_1 AUTO_SCALING=ON PASSES=1 JOB_COUNT=20 ./run.sh start case1
Or run the TestCase2 with 5 jobs:
> TAG=round_1 AUTO_SCALING=ON JOB_COUNT=6 ./run.sh start case2
Note: the test output will be written to different folders (the folder name is generated based on the test configuration), so it's ok to run the tests in a loop to get multiple round of data:
> for i in `seq 1 2`; do echo pass $i; TAG=round_$i JOB_COUNT=6 ./run.sh start case2; done
pass 1
outputing output to folder: ./out/mnist-OFF-6-1-ON-400-case_case2-round_1
Please refer to the README.
The time series data will be appended to the file
./out/*/mnist-case[1|2]-pass[0-9].log
, the content of *
depends on
the test case config and will be printed in the beginning.
as the following format:
0,2.11,0,3,0,0,0,0,0|0|0,0.00|0.00|0.00
2,2.11,0,3,0,0,0,0,0|0|0,0.00|0.00|0.00
4,2.11,0,3,0,0,0,0,0|0|0,0.00|0.00|0.00
5,2.11,0,2,1,0,0,0,0|0|0,0.00|0.00|0.00
7,5.30,7,2,0,1,0,0,7|0|0,3.19|0.00|0.00
9,7.90,19,2,0,1,0,0,19|0|0,5.79|0.00|0.00
10,8.11,20,2,0,1,0,0,20|0|0,6.01|0.00|0.00
The meaning of each column is:
timestamp | total cpu util | # of running trainer | # of not exist jobs | # of pending jobs | # of running jobs | # of done jobs | # of Nginx pods | running trainers for each job | cpu utils for each job |
---|