Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX jobs can't tolerate the K8s master nodes restart or termination #13350

Closed
5 of 9 tasks
elibogomolnyi opened this issue Dec 19, 2022 · 21 comments
Closed
5 of 9 tasks

Comments

@elibogomolnyi
Copy link

elibogomolnyi commented Dec 19, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

When one of the master (control plane node in case of EKS) nodes gets terminated or restarted, all the AWX jobs related to this node (we don't know how they are linked to this master node, but we can definitely see that they are) also get terminated. We see the "Error" status for these jobs in UI. We checked this behavior with the following configurations:

  • kOps v1.25.5, AWX 21.1.0, 3 master nodes, 3 worker nodes, k8s-pod-network
  • kOps v1.25.5, AWX 21.10.1, 3 master nodes, 3 worker nodes, k8s-pod-network
  • kOps v1.24.8, AWX 21.10.1, 3 master nodes, 3 worker nodes, k8s-pod-network
  • EKS v1.24.7, AWX 21.1.0, EKS control plane, 3 worker nodes, Calico CNI (we had to ask AWS support to restart one of the Control Plane nodes to reproduce this issue)
  • EKS v1.24.7, AWX 21.1.0, EKS control plane, 3 worker nodes, AWS CNI (we had to ask AWS support to restart one of the Control Plane nodes to reproduce this issue)

The problem becomes more severe for the EKS clusters since AWS sometimes brings down the master nodes to make the package upgrades, and we can't control it. As a result, whenever it happens, the jobs that are somehow connected to the restarted or terminated master node become killed with "Error" without any discoverable reason.

AWX version

21.10.1

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

Kubernetes

Modifications

no

Steps to reproduce

  1. Create a K8s cluster by kOps
  2. Install AWX on it using the operator
  3. Run some sample AWX jobs
  4. Terminate or stop one of the master nodes that is a part of the auto-scaling group. The terminated master node should include only the K8s system-related pods (like kube-apiserver, kube-scheduler, and so on). If the AWX jobs are not terminated with the error, wait for the complete replacement of the terminated node and bring up another node until you get this error.

Expected results

The AWX jobs keep running.

Actual results

The AWX jobs are terminated; in the UI, we can see "Error" without any logs.

Additional information

Screenshot 2022-12-19 at 12 06 31
failed_jobs_api_description.txt

@shanemcd
Copy link
Member

Presumably this is due to how we use the k8s logging api to obtain the job results. We did ship a recent patch that will attempt to reconnect when the log stream has been terminated unexpectedly, but we haven't tested under these conditions.

@AlanCoding @fosterseth From failed_jobs_api_description.txt - I can see that this is another example where we are shoving the entirety of the stdout into result_traceback, which has to be this code:

lines = resultsock.readlines()
receptor_output = b"".join(lines).decode()
if receptor_output:
self.task.runner_callback.delay_update(result_traceback=receptor_output)

This also makes me wonder if we might be overwriting another error that happened here:

elif status_data['status'] == 'error':
result_traceback = status_data.get('result_traceback', None)
if result_traceback:
self.delay_update(result_traceback=result_traceback)

Does anyone watching this issue feel comfortable patching and building a custom AWX image? We might be able to provide some guidance on what to try. I probably won't have time to look into this myself before sometime early next year.

@shanemcd
Copy link
Member

I'd like to get some clarification on something here.

Are nodes where AWX itself is running getting killed? Or just nodes where the Kubernetes API server is running?

@elibogomolnyi
Copy link
Author

I'd like to get some clarification on something here.

Are nodes where AWX itself is running getting killed? Or just nodes where the Kubernetes API server is running?

Only master nodes, where the K8s API server is running.

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Dec 20, 2022

Does anyone watching this issue feel comfortable patching and building a custom AWX image? We might be able to provide some guidance on what to try. I probably won't have time to look into this myself before sometime early next year.

@shanemcd, I didn't do it before with the custom AWX images. Still, since I already checked all kinds of scenarios with different AWX and K8s versions, I think I could join you in troubleshooting, patching images, and checking the results. We are very interested in resolving this issue as soon as possible.
Could you define the scope of work and provide as much detailed guidance as possible? I will look into it with my team, and we'll think about how we can contribute to resolving this case.

@shanemcd
Copy link
Member

@elibogomolnyi How long does the k8s api stay unavailable for?

In Receptor 1.3.0 we shipped this patch that attempts to recover when the log stream is unexpectedly terminated. Can you please verify that your control plane ee has this version of Receptor?

Could you define the scope of work and provide as much detailed guidance as possible?

Apologies, but you are asking for too much here. As I said before - I do not have time to look into this too deeply right now. I'm only working 2 more days before stepping away from work until sometime in early January. If you are unable to troubleshoot and resolve this problem yourself, perhaps a short-term solution would be to deploy into a distro of Kubernetes that does not have the auto-update behavior.

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Dec 20, 2022

@shanemcd,

How long does the k8s api stay unavailable for

After the master node is terminated, it stays unavailable for 5 minutes. It is also worth mentioning that when I terminate the master node, all AWX jobs get terminated almost immediately (20 seconds after the termination signal that I send to the node), so it doesn't seem like any retry mechanism is working in this case.

In Receptor 1.3.0 we shipped this patch that attempts to recover when the log stream is unexpectedly terminated. Can you please verify that your control plane ee has this version of Receptor?

The receptor version is 1.3.0+g8f8481c
I can also see this message in a log:
Kubernetes version v1.24.8 is at least v1.24.8, using reconnect support
, which means that the new receptor functionality is in use.

I'm only working 2 more days before stepping away from work until sometime in early January

I fully understand it, and maybe I expressed myself wrongly. You said that you might be able to provide some guidance on what to try. If you think that this guidance can help resolve this issue for somebody, who is not a contributor to this project and doesn't have enough experience with this project, please let me know, what we can try, and we will elaborate with our team. And by the way, happy holidays!

If you are unable to troubleshoot and resolve this problem yourself, perhaps a short-term solution would be to deploy into a distro of Kubernetes that does not have the auto-update behavior.

Since the aim of our project is a migration to EKS, we can't deploy into a distro of Kubernetes that does not have the auto-update behavior. But we might wait for the production migration till this issue is resolved. If we could make it happen faster, we would be glad to contribute.

@fosterseth
Copy link
Member

fosterseth commented Dec 21, 2022

PR to address the result_traceback bug here #12961

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Dec 21, 2022

@fosterseth

related PR here #12961

Could you explain how it is related? Do you think it might fix the issue caused by the master node restart?

@fosterseth
Copy link
Member

@elibogomolnyi sorry, the PR I linked is for the result_traceback bug that shane pointed out

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Feb 2, 2023

@elibogomolnyi How long does the k8s api stay unavailable for?

In Receptor 1.3.0 we shipped this patch that attempts to recover when the log stream is unexpectedly terminated. Can you please verify that your control plane ee has this version of Receptor?

Could you define the scope of work and provide as much detailed guidance as possible?

Apologies, but you are asking for too much here. As I said before - I do not have time to look into this too deeply right now. I'm only working 2 more days before stepping away from work until sometime in early January. If you are unable to troubleshoot and resolve this problem yourself, perhaps a short-term solution would be to deploy into a distro of Kubernetes that does not have the auto-update behavior.

Hi @shanemcd, please tell me if there is anything else we can do to help to fix this bug.

@eranyo
Copy link

eranyo commented Mar 12, 2023

Hi AWX community and team,
(@fosterseth @shanemcd)
I'm working with @elibogomolnyi in the same team and we are facing this issue while running a high-traffic load, ~35K jobs.
If you need we have a performance env where we can run a high amount of active jobs very easily.
This issue is facing us to running this high load in our production env.

I hope to hear from you soon.

Thanks,
Eran.

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Apr 4, 2023

Following the conversation with @TheRealHaoLiu about this issue, we made some additional tests:
We created a kOps cluster version 1.24.12, installed the latest version of AWX on it, ran some AWX jobs, and terminated one of the kOps master nodes.
Please see the logs with our comments:
AWX logs.txt

We also checked that during the master node termination, we can still access the K8s API. We were continuously triggering the "kubectl get nodes" command, which was not interrupted. So the Kubernetes API kept working.

I am attaching the instructions for deploying the kOps cluster and AWX for full error reproduction:
Deploying the kOps.txt
AWX deployment.txt

We get the following error when we one of the master node gets terminated:
ERROR 2023/04/04 09:26:02 [7EN3OaHV] Error reading from pod awx/automation-job-3-5wfl8: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR

@TheRealHaoLiu
Copy link
Member

i put up a very rough PR to test out if catching GOAWAY error and retry will help work around this problem

@TheRealHaoLiu
Copy link
Member

TheRealHaoLiu commented Apr 4, 2023

@elibogomolnyi thanks for helping us identify the specific error we encountering

here's a test image with my code change quay.io/haoliu/awx-ee:goaway

can u replace the control_plane_ee_image with this image and run the same test scenario again

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Apr 5, 2023

@TheRealHaoLiu, now it works like a charm with the kOps cluster! The job keeps running.
DEBUG 2023/04/05 07:55:42 [nrzQBHF9] Detected http2.GoAwayError for pod awx/automation-job-7-xn7kk. Will retry 5 more times. Error: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug=""

It will take some time for us to check this issue with the EKS cluster since it requires cooperation from the AWS support side. But as far as I understand, it should also fix the EKS issue.

I appreciate your help; it is a very important fix. When can it be merged?

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Apr 13, 2023

Hi @TheRealHaoLiu, I've checked how AWX works with EKS with the AWS support team, and everything works like a charm with your fix.
I used the image quay.io/haoliu/awx-ee:goaway

Thanks to the community for promoting this PR so fast. If this PR is already merged to devel of receptor, does it mean the new AWX version will already contain this change?

@elibogomolnyi
Copy link
Author

It is also worth mentioning that with the customized image, AWX can tolerate the EKS master node termination but can't tolerate the EKS control plane upgrade. It is not a problem for us since the EKS upgrade requires maintenance and downtime, but it is good to know about it.

@TheRealHaoLiu
Copy link
Member

interesting, i tested this for OCP upgrade and it held up pretty well... have u try to use the graceful termination feature for awx and PodDisruptionBudget in kube?

I'm working something to show how to make AWX tolerate kube upgrade with no downtime

@TheRealHaoLiu
Copy link
Member

what do you observe during EKS controlplane upgrade? is the API server still reachable?

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Jun 11, 2023

interesting, i tested this for OCP upgrade and it held up pretty well... have u try to use the graceful termination feature for awx and PodDisruptionBudget in kube?

I'm working something to show how to make AWX tolerate kube upgrade with no downtime

Hi @TheRealHaoLiu,

I didn't try to use the graceful termination and PodDisruptionBudget, but I can do it when we continue our performance tests.

what do you observe during EKS controlplane upgrade? is the API server still reachable?

I didn't check it myself, but EKS API might be unreachable during this process, according to AWS EKS documentation. If the API is not accessible for some time during the upgrade, does it mean that AWX can't reconnect to it?

https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

image

@CFSNM
Copy link
Contributor

CFSNM commented Jun 21, 2023

Wont add coverage for this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants