Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASOC 2022: Optimizes the pods recovery efficiency when edge nodes restart #858

Closed
rambohe-ch opened this issue May 27, 2022 · 10 comments
Closed
Labels
asoc2022 asoc2022 Alibaba Summer of Code, 2022

Comments

@rambohe-ch
Copy link
Member

rambohe-ch commented May 27, 2022

Describe your problem

When the cloud edge network is disconnected in the OpenYurt cluster, the recovery time of the Pods is about 1 minute after the edge node restarts. Expect to improve Pod restart efficiency and reduce recovery time to 30s.

OpenYurt集群中云边网络断连时,边缘节点重启时当前业务Pod恢复时间在1min左右。期待可以提升Pod重启效率,恢复时间降低到30s。

Additional context

This issue is part of our Alibaba Summer of Code (2022) Program

Difficulty: Hard
Mentor: helinbo (@rambohe-ch )

@rambohe-ch rambohe-ch added the asoc2022 asoc2022 Alibaba Summer of Code, 2022 label May 27, 2022
@Congrool
Copy link
Member

Should the difficulty be "hard"?

@rambohe-ch
Copy link
Member Author

Should the difficulty be "hard"?

@Congrool Thank you for pointing out the miss. I've fixed it.

@Sodawyx
Copy link
Contributor

Sodawyx commented Jun 10, 2022

Some ideas of the solutions to this problem:

  • when the node in the restarting status, the steps can be described easily as follows: 1) OS restart, 2) kubernetes components restart, 3) OpenYurt components (Yurt Hub/ Yurt tunnel agency) restart;
  • The optimization should be focused on the step 3 (OpenYurt components)
  • The test scenarios can be arranged as follows: 1) measure the OS restarting time and Kubernetes components restarting time; 2) set the log points and measure the openyurt components restarting time; 3) find the detailed process and locate the code blocks and optimize them;

Also, there are some questions:

  • the restart time should be relevant to the hardware of the edge nodes. The computing performance and the network status may matter the restart time. Therefore, the initial test should be in a reasonable environment.

@rambohe-ch
Copy link
Member Author

Some ideas of the solutions to this problem:

  • when the node in the restarting status, the steps can be described easily as follows: 1) OS restart, 2) kubernetes components restart, 3) OpenYurt components (Yurt Hub/ Yurt tunnel agency) restart;
  • The optimization should be focused on the step 3 (OpenYurt components)
  • The test scenarios can be arranged as follows: 1) measure the OS restarting time and Kubernetes components restarting time; 2) set the log points and measure the openyurt components restarting time; 3) find the detailed process and locate the code blocks and optimize them;

Also, there are some questions:

  • the restart time should be relevant to the hardware of the edge nodes. The computing performance and the network status may matter the restart time. Therefore, the initial test should be in a reasonable environment.

@Sodawyx Very appreciate for your ideas. and i agree with that optimization should focus on the restarting time of OpenYurt components. by the way, Yurthub run as a static pod and should be started up before other components, because kubelet need to list pods metadata from local disk through Yurthub.

@rambohe-ch
Copy link
Member Author

rambohe-ch commented Jun 13, 2022

Some ideas of the solutions to this problem:

  • when the node in the restarting status, the steps can be described easily as follows: 1) OS restart, 2) kubernetes components restart, 3) OpenYurt components (Yurt Hub/ Yurt tunnel agency) restart;
  • The optimization should be focused on the step 3 (OpenYurt components)
  • The test scenarios can be arranged as follows: 1) measure the OS restarting time and Kubernetes components restarting time; 2) set the log points and measure the openyurt components restarting time; 3) find the detailed process and locate the code blocks and optimize them;

Also, there are some questions:

  • the restart time should be relevant to the hardware of the edge nodes. The computing performance and the network status may matter the restart time. Therefore, the initial test should be in a reasonable environment.

@Sodawyx Very appreciate for your ideas. and i agree with that optimization should focus on the restarting time of OpenYurt components. By the way, Yurthub run as a static pod and should be started up before other components, because kubelet need to list pods metadata from local disk through Yurthub, so i think it's not so hard to optimize only this part of work.
And how about add another optimization work for this task, like optimize Yurthub cache efficiency?

@Congrool
Copy link
Member

@rambohe-ch
We are woking on pool-coordiantor cache in coming v0.8.0, so the cache structure of yurthub will have a great change. During ASOC when the v0.8.0 hasnot been released, if we meanwhile work on the optimization of cache efficiency based on v0.7.0, there will be a lot of conflict when merging these two features. So I suggest that we should optimize cache efficiency after v0.8.0, or at least the nodepool-governence capability being merged.

@rambohe-ch
Copy link
Member Author

@rambohe-ch We are woking on pool-coordiantor cache in coming v0.8.0, so the cache structure of yurthub will have a great change. During ASOC when the v0.8.0 hasnot been released, if we meanwhile work on the optimization of cache efficiency based on v0.7.0, there will be a lot of conflict when merging these two features. So I suggest that we should optimize cache efficiency after v0.8.0, or at least the nodepool-governence capability being merged.

@Congrool Thanks for your suggestions, yeah, we can optimize cache efficiency after pool-coordinator.

@Sodawyx
Copy link
Contributor

Sodawyx commented Jun 15, 2022

Some ideas of the solutions to this problem:

  • when the node in the restarting status, the steps can be described easily as follows: 1) OS restart, 2) kubernetes components restart, 3) OpenYurt components (Yurt Hub/ Yurt tunnel agency) restart;
  • The optimization should be focused on the step 3 (OpenYurt components)
  • The test scenarios can be arranged as follows: 1) measure the OS restarting time and Kubernetes components restarting time; 2) set the log points and measure the openyurt components restarting time; 3) find the detailed process and locate the code blocks and optimize them;

Also, there are some questions:

  • the restart time should be relevant to the hardware of the edge nodes. The computing performance and the network status may matter the restart time. Therefore, the initial test should be in a reasonable environment.

@Sodawyx Very appreciate for your ideas. and i agree with that optimization should focus on the restarting time of OpenYurt components. By the way, Yurthub run as a static pod and should be started up before other components, because kubelet need to list pods metadata from local disk through Yurthub, so i think it's not so hard to optimize only this part of work. And how about add another optimization work for this task, like optimize Yurthub cache efficiency?

I think it is reasonable. And I will investigate how to optimize Yurthub. Specially, I will focus on the process of kubelet list pods metadata from local disk through Yurthub.

@rambohe-ch
Copy link
Member Author

  1. contrast to native k8s, how much cost of time between kubelet startup and all pods startup when edge nodes restart of OpenYurt
  2. analyze the startup time where they spend. then find out how to optimize and implement the solutions.

@rambohe-ch
Copy link
Member Author

Fixes by #930

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asoc2022 asoc2022 Alibaba Summer of Code, 2022
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants