-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059
Conversation
Build Failed 😱 Build Id: 99b6ccdf-3e84-4532-9760-0e2cba467a32 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Succeeded 👏 Build Id: a2550e3b-5312-4b8f-a749-ae3618ee84ed The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
81c99ad
to
1a0d261
Compare
Build Failed 😱 Build Id: f40cfb94-2f8e-4473-9d47-5e99e6a357f6 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 02966393-ed9a-4faa-a624-b32afc20c9b1 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: ca82581f-ca96-45b0-8796-a6da0744aa61 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
This PR reworks the e2e timeouts to allow for more time for a given build to wait to run e2es, but tightens the e2e deadline slightly: * Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming in close to an hour in some cases but now that we're not running consul, we don't need it as high as 2h. I don't think it's worth tightening this all the way to an hour, though it would probably work. * Also drops the queueTtl for the CI sub-builds, these should not be queued for long since we serialize e2es now. * Extends the e2e-wait-to-become-leader timeout to 3h. In higher traffic times, we're hitting this limit often now, which only results in a vicious cycle of retrying PRs. Instead wait longer to become leader. * Bumps the global timeout to 5h after aggregating: 3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else) * Remove vestigates of consul - it's no longer running anywhere.
1a0d261
to
d713c7a
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gongmax, zmerlynn The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Build Succeeded 👏 Build Id: a8173545-828e-44bd-a45c-c10faf4e06b3 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
* Rework health check handling of InitialDelaySeconds See #2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in #3059)
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
* Rework game server health initial delay handling This is a redrive of #3046, which was reverted in #3068 Rework health check handling of InitialDelaySeconds. See #2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in #3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix #2445 as well
…orgames#3059) * Extend e2e queue timings This PR reworks the e2e timeouts to allow for more time for a given build to wait to run e2es, but tightens the e2e deadline slightly: * Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming in close to an hour in some cases but now that we're not running consul, we don't need it as high as 2h. I don't think it's worth tightening this all the way to an hour, though it would probably work. * Also drops the queueTtl for the CI sub-builds, these should not be queued for long since we serialize e2es now. * Extends the e2e-wait-to-become-leader timeout to 3h. In higher traffic times, we're hitting this limit often now, which only results in a vicious cycle of retrying PRs. Instead wait longer to become leader. * Bumps the global timeout to 5h after aggregating: 3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else) * Remove vestigates of consul - it's no longer running anywhere. * Stop testing on Autopilot 1.26 until after googleforgames#3046
* Rework health check handling of InitialDelaySeconds See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
* Rework game server health initial delay handling This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well
This PR reworks the e2e timeouts to allow for more time for a given build to wait to run e2es, but tightens the e2e deadline slightly:
Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming in close to an hour in some cases but now that we're not running consul, we don't need it as high as 2h. I don't think it's worth tightening this all the way to an hour, though it would probably work.
Extends the e2e-wait-to-become-leader timeout to 3h. In higher traffic times, we're hitting this limit often now, which only results in a vicious cycle of retrying PRs. Instead wait longer to become leader.
Bumps the global timeout to 5h after aggregating: 3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else)
Remove vestigates of consul - it's no longer running anywhere.
Disables testing on Autopilot 1.26 to work around CI flaky on Autopilot 1.26 #3058. Will reenable in Rework game server health initial delay handling #3046.