AWX jobs can't tolerate the K8s master nodes restart or termination #13350

elibogomolnyi · 2022-12-19T10:26:35Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

When one of the master (control plane node in case of EKS) nodes gets terminated or restarted, all the AWX jobs related to this node (we don't know how they are linked to this master node, but we can definitely see that they are) also get terminated. We see the "Error" status for these jobs in UI. We checked this behavior with the following configurations:

kOps v1.25.5, AWX 21.1.0, 3 master nodes, 3 worker nodes, k8s-pod-network
kOps v1.25.5, AWX 21.10.1, 3 master nodes, 3 worker nodes, k8s-pod-network
kOps v1.24.8, AWX 21.10.1, 3 master nodes, 3 worker nodes, k8s-pod-network
EKS v1.24.7, AWX 21.1.0, EKS control plane, 3 worker nodes, Calico CNI (we had to ask AWS support to restart one of the Control Plane nodes to reproduce this issue)
EKS v1.24.7, AWX 21.1.0, EKS control plane, 3 worker nodes, AWS CNI (we had to ask AWS support to restart one of the Control Plane nodes to reproduce this issue)

The problem becomes more severe for the EKS clusters since AWS sometimes brings down the master nodes to make the package upgrades, and we can't control it. As a result, whenever it happens, the jobs that are somehow connected to the restarted or terminated master node become killed with "Error" without any discoverable reason.

AWX version

21.10.1

Select the relevant components

Installation method

Kubernetes

Modifications

no

Steps to reproduce

Create a K8s cluster by kOps
Install AWX on it using the operator
Run some sample AWX jobs
Terminate or stop one of the master nodes that is a part of the auto-scaling group. The terminated master node should include only the K8s system-related pods (like kube-apiserver, kube-scheduler, and so on). If the AWX jobs are not terminated with the error, wait for the complete replacement of the terminated node and bring up another node until you get this error.

Expected results

The AWX jobs keep running.

Actual results

The AWX jobs are terminated; in the UI, we can see "Error" without any logs.

Additional information

failed_jobs_api_description.txt

shanemcd · 2022-12-19T15:05:32Z

Presumably this is due to how we use the k8s logging api to obtain the job results. We did ship a recent patch that will attempt to reconnect when the log stream has been terminated unexpectedly, but we haven't tested under these conditions.

@AlanCoding @fosterseth From failed_jobs_api_description.txt - I can see that this is another example where we are shoving the entirety of the stdout into result_traceback, which has to be this code:

awx/awx/main/tasks/receptor.py

Lines 428 to 431 in b7f2825

    
           lines = resultsock.readlines() 
        
           receptor_output = b"".join(lines).decode() 
        
           if receptor_output: 
        
               self.task.runner_callback.delay_update(result_traceback=receptor_output)

This also makes me wonder if we might be overwriting another error that happened here:

awx/awx/main/tasks/callback.py

Lines 207 to 210 in 893dba7

    
           elif status_data['status'] == 'error': 
        
               result_traceback = status_data.get('result_traceback', None) 
        
               if result_traceback: 
        
                   self.delay_update(result_traceback=result_traceback)

Does anyone watching this issue feel comfortable patching and building a custom AWX image? We might be able to provide some guidance on what to try. I probably won't have time to look into this myself before sometime early next year.

shanemcd · 2022-12-19T17:10:48Z

I'd like to get some clarification on something here.

Are nodes where AWX itself is running getting killed? Or just nodes where the Kubernetes API server is running?

elibogomolnyi · 2022-12-19T17:20:03Z

I'd like to get some clarification on something here.

Are nodes where AWX itself is running getting killed? Or just nodes where the Kubernetes API server is running?

Only master nodes, where the K8s API server is running.

elibogomolnyi · 2022-12-20T12:36:47Z

Does anyone watching this issue feel comfortable patching and building a custom AWX image? We might be able to provide some guidance on what to try. I probably won't have time to look into this myself before sometime early next year.

@shanemcd, I didn't do it before with the custom AWX images. Still, since I already checked all kinds of scenarios with different AWX and K8s versions, I think I could join you in troubleshooting, patching images, and checking the results. We are very interested in resolving this issue as soon as possible.
Could you define the scope of work and provide as much detailed guidance as possible? I will look into it with my team, and we'll think about how we can contribute to resolving this case.

shanemcd · 2022-12-20T13:36:28Z

@elibogomolnyi How long does the k8s api stay unavailable for?

In Receptor 1.3.0 we shipped this patch that attempts to recover when the log stream is unexpectedly terminated. Can you please verify that your control plane ee has this version of Receptor?

Could you define the scope of work and provide as much detailed guidance as possible?

Apologies, but you are asking for too much here. As I said before - I do not have time to look into this too deeply right now. I'm only working 2 more days before stepping away from work until sometime in early January. If you are unable to troubleshoot and resolve this problem yourself, perhaps a short-term solution would be to deploy into a distro of Kubernetes that does not have the auto-update behavior.

elibogomolnyi · 2022-12-20T18:12:58Z

@shanemcd,

How long does the k8s api stay unavailable for

After the master node is terminated, it stays unavailable for 5 minutes. It is also worth mentioning that when I terminate the master node, all AWX jobs get terminated almost immediately (20 seconds after the termination signal that I send to the node), so it doesn't seem like any retry mechanism is working in this case.

In Receptor 1.3.0 we shipped this patch that attempts to recover when the log stream is unexpectedly terminated. Can you please verify that your control plane ee has this version of Receptor?

The receptor version is 1.3.0+g8f8481c
I can also see this message in a log:
Kubernetes version v1.24.8 is at least v1.24.8, using reconnect support
, which means that the new receptor functionality is in use.

I'm only working 2 more days before stepping away from work until sometime in early January

I fully understand it, and maybe I expressed myself wrongly. You said that you might be able to provide some guidance on what to try. If you think that this guidance can help resolve this issue for somebody, who is not a contributor to this project and doesn't have enough experience with this project, please let me know, what we can try, and we will elaborate with our team. And by the way, happy holidays!

If you are unable to troubleshoot and resolve this problem yourself, perhaps a short-term solution would be to deploy into a distro of Kubernetes that does not have the auto-update behavior.

Since the aim of our project is a migration to EKS, we can't deploy into a distro of Kubernetes that does not have the auto-update behavior. But we might wait for the production migration till this issue is resolved. If we could make it happen faster, we would be glad to contribute.

fosterseth · 2022-12-21T16:41:30Z

PR to address the result_traceback bug here #12961

elibogomolnyi · 2022-12-21T17:08:27Z

@fosterseth

related PR here #12961

Could you explain how it is related? Do you think it might fix the issue caused by the master node restart?

fosterseth · 2022-12-21T18:14:19Z

@elibogomolnyi sorry, the PR I linked is for the result_traceback bug that shane pointed out

elibogomolnyi · 2023-02-02T13:22:26Z

@elibogomolnyi How long does the k8s api stay unavailable for?

In Receptor 1.3.0 we shipped this patch that attempts to recover when the log stream is unexpectedly terminated. Can you please verify that your control plane ee has this version of Receptor?

Could you define the scope of work and provide as much detailed guidance as possible?

Apologies, but you are asking for too much here. As I said before - I do not have time to look into this too deeply right now. I'm only working 2 more days before stepping away from work until sometime in early January. If you are unable to troubleshoot and resolve this problem yourself, perhaps a short-term solution would be to deploy into a distro of Kubernetes that does not have the auto-update behavior.

Hi @shanemcd, please tell me if there is anything else we can do to help to fix this bug.

eranyo · 2023-03-12T14:37:33Z

Hi AWX community and team,
(@fosterseth @shanemcd)
I'm working with @elibogomolnyi in the same team and we are facing this issue while running a high-traffic load, ~35K jobs.
If you need we have a performance env where we can run a high amount of active jobs very easily.
This issue is facing us to running this high load in our production env.

I hope to hear from you soon.

Thanks,
Eran.

elibogomolnyi · 2023-04-04T10:42:16Z

Following the conversation with @TheRealHaoLiu about this issue, we made some additional tests:
We created a kOps cluster version 1.24.12, installed the latest version of AWX on it, ran some AWX jobs, and terminated one of the kOps master nodes.
Please see the logs with our comments:
AWX logs.txt

We also checked that during the master node termination, we can still access the K8s API. We were continuously triggering the "kubectl get nodes" command, which was not interrupted. So the Kubernetes API kept working.

I am attaching the instructions for deploying the kOps cluster and AWX for full error reproduction:
Deploying the kOps.txt
AWX deployment.txt

We get the following error when we one of the master node gets terminated:
ERROR 2023/04/04 09:26:02 [7EN3OaHV] Error reading from pod awx/automation-job-3-5wfl8: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR

TheRealHaoLiu · 2023-04-04T18:11:09Z

i put up a very rough PR to test out if catching GOAWAY error and retry will help work around this problem

TheRealHaoLiu · 2023-04-04T18:34:40Z

@elibogomolnyi thanks for helping us identify the specific error we encountering

here's a test image with my code change quay.io/haoliu/awx-ee:goaway

can u replace the control_plane_ee_image with this image and run the same test scenario again

elibogomolnyi · 2023-04-05T08:21:25Z

@TheRealHaoLiu, now it works like a charm with the kOps cluster! The job keeps running.
DEBUG 2023/04/05 07:55:42 [nrzQBHF9] Detected http2.GoAwayError for pod awx/automation-job-7-xn7kk. Will retry 5 more times. Error: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug=""

It will take some time for us to check this issue with the EKS cluster since it requires cooperation from the AWS support side. But as far as I understand, it should also fix the EKS issue.

I appreciate your help; it is a very important fix. When can it be merged?

elibogomolnyi · 2023-04-13T14:29:59Z

Hi @TheRealHaoLiu, I've checked how AWX works with EKS with the AWS support team, and everything works like a charm with your fix.
I used the image quay.io/haoliu/awx-ee:goaway

Thanks to the community for promoting this PR so fast. If this PR is already merged to devel of receptor, does it mean the new AWX version will already contain this change?

elibogomolnyi · 2023-04-17T09:14:25Z

It is also worth mentioning that with the customized image, AWX can tolerate the EKS master node termination but can't tolerate the EKS control plane upgrade. It is not a problem for us since the EKS upgrade requires maintenance and downtime, but it is good to know about it.

TheRealHaoLiu · 2023-05-31T15:07:56Z

interesting, i tested this for OCP upgrade and it held up pretty well... have u try to use the graceful termination feature for awx and PodDisruptionBudget in kube?

I'm working something to show how to make AWX tolerate kube upgrade with no downtime

TheRealHaoLiu · 2023-05-31T15:08:40Z

what do you observe during EKS controlplane upgrade? is the API server still reachable?

elibogomolnyi · 2023-06-11T09:21:04Z

interesting, i tested this for OCP upgrade and it held up pretty well... have u try to use the graceful termination feature for awx and PodDisruptionBudget in kube?

I'm working something to show how to make AWX tolerate kube upgrade with no downtime

Hi @TheRealHaoLiu,

I didn't try to use the graceful termination and PodDisruptionBudget, but I can do it when we continue our performance tests.

what do you observe during EKS controlplane upgrade? is the API server still reachable?

I didn't check it myself, but EKS API might be unreachable during this process, according to AWS EKS documentation. If the API is not accessible for some time during the upgrade, does it mean that AWX can't reconnect to it?

https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

CFSNM · 2023-06-21T13:28:21Z

Wont add coverage for this issue

github-actions bot added component:api needs_triage type:bug community labels Dec 19, 2022

TheRealHaoLiu self-assigned this Apr 4, 2023

TheRealHaoLiu mentioned this issue Apr 4, 2023

Retry on GOAWAY error ansible/receptor#760

Merged

CFSNM closed this as completed Jun 21, 2023

elibogomolnyi mentioned this issue Jul 28, 2023

AWX jobs can't tolerate the EKS scale out #14293

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWX jobs can't tolerate the K8s master nodes restart or termination #13350

AWX jobs can't tolerate the K8s master nodes restart or termination #13350

elibogomolnyi commented Dec 19, 2022 •

edited

Loading

shanemcd commented Dec 19, 2022

shanemcd commented Dec 19, 2022

elibogomolnyi commented Dec 19, 2022

elibogomolnyi commented Dec 20, 2022 •

edited

Loading

shanemcd commented Dec 20, 2022

elibogomolnyi commented Dec 20, 2022 •

edited

Loading

fosterseth commented Dec 21, 2022 •

edited

Loading

elibogomolnyi commented Dec 21, 2022 •

edited

Loading

fosterseth commented Dec 21, 2022

elibogomolnyi commented Feb 2, 2023 •

edited

Loading

eranyo commented Mar 12, 2023 •

edited

Loading

elibogomolnyi commented Apr 4, 2023 •

edited

Loading

TheRealHaoLiu commented Apr 4, 2023

TheRealHaoLiu commented Apr 4, 2023 •

edited

Loading

elibogomolnyi commented Apr 5, 2023 •

edited

Loading

elibogomolnyi commented Apr 13, 2023 •

edited

Loading

elibogomolnyi commented Apr 17, 2023

TheRealHaoLiu commented May 31, 2023

TheRealHaoLiu commented May 31, 2023

elibogomolnyi commented Jun 11, 2023 •

edited

Loading

CFSNM commented Jun 21, 2023

AWX jobs can't tolerate the K8s master nodes restart or termination #13350

AWX jobs can't tolerate the K8s master nodes restart or termination #13350

Comments

elibogomolnyi commented Dec 19, 2022 • edited Loading

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Steps to reproduce

Expected results

Actual results

Additional information

shanemcd commented Dec 19, 2022

shanemcd commented Dec 19, 2022

elibogomolnyi commented Dec 19, 2022

elibogomolnyi commented Dec 20, 2022 • edited Loading

shanemcd commented Dec 20, 2022

elibogomolnyi commented Dec 20, 2022 • edited Loading

fosterseth commented Dec 21, 2022 • edited Loading

elibogomolnyi commented Dec 21, 2022 • edited Loading

fosterseth commented Dec 21, 2022

elibogomolnyi commented Feb 2, 2023 • edited Loading

eranyo commented Mar 12, 2023 • edited Loading

elibogomolnyi commented Apr 4, 2023 • edited Loading

TheRealHaoLiu commented Apr 4, 2023

TheRealHaoLiu commented Apr 4, 2023 • edited Loading

elibogomolnyi commented Apr 5, 2023 • edited Loading

elibogomolnyi commented Apr 13, 2023 • edited Loading

elibogomolnyi commented Apr 17, 2023

TheRealHaoLiu commented May 31, 2023

TheRealHaoLiu commented May 31, 2023

elibogomolnyi commented Jun 11, 2023 • edited Loading

CFSNM commented Jun 21, 2023

elibogomolnyi commented Dec 19, 2022 •

edited

Loading

elibogomolnyi commented Dec 20, 2022 •

edited

Loading

elibogomolnyi commented Dec 20, 2022 •

edited

Loading

fosterseth commented Dec 21, 2022 •

edited

Loading

elibogomolnyi commented Dec 21, 2022 •

edited

Loading

elibogomolnyi commented Feb 2, 2023 •

edited

Loading

eranyo commented Mar 12, 2023 •

edited

Loading

elibogomolnyi commented Apr 4, 2023 •

edited

Loading

TheRealHaoLiu commented Apr 4, 2023 •

edited

Loading

elibogomolnyi commented Apr 5, 2023 •

edited

Loading

elibogomolnyi commented Apr 13, 2023 •

edited

Loading

elibogomolnyi commented Jun 11, 2023 •

edited

Loading