Frequent termination of scale set pods #3544

sjm-ho · 2024-05-26T06:26:17Z

sjm-ho
May 26, 2024

Hey ARC community, I am actually running a github actions build which runs ephemera runner sets on my kubernetes clusters. Having said that, the clusters are actually integrated with CAST AI, so the node provisioning is handled by CAST AI itself.

Now even though we were using spot instances for the runner pods, we have switched to on demand nodes, but still we are faing issues like

[publish_image (test, false)](https://github.com/headout/magellan/actions/runs/9241155685/job/25422290263) The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

AND

The operation was canceled.

Without any legitimate reason. Would like some workaround / fix to stabilise the ARC runners since it's resulting in very frequent unreliable builds which are as I mentioned failing

nikola-jokic · 2024-05-27T11:16:26Z

nikola-jokic
May 27, 2024

Hey @sjm-ho,

Shutdown signals usually happen when a node does not have enough resources, so it kills the container. To avoid this scenario, please use requests and limits. I cannot speak to the actual reason for termination, but from my experience, that is what is usually the reason.

It is unlikely that the shutdown signal is happening without any legitimate reason. I would suggest digging into the kubelet log and figuring out what is happening.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent termination of scale set pods #3544

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Frequent termination of scale set pods #3544

sjm-ho May 26, 2024

Replies: 1 comment

nikola-jokic May 27, 2024

sjm-ho
May 26, 2024

nikola-jokic
May 27, 2024