Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059

zmerlynn · 2023-03-31T21:04:26Z

This PR reworks the e2e timeouts to allow for more time for a given build to wait to run e2es, but tightens the e2e deadline slightly:

Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming in close to an hour in some cases but now that we're not running consul, we don't need it as high as 2h. I don't think it's worth tightening this all the way to an hour, though it would probably work.
- Also drops the queueTtl for the CI sub-builds, these should not be queued for long since we serialize e2es now.
Extends the e2e-wait-to-become-leader timeout to 3h. In higher traffic times, we're hitting this limit often now, which only results in a vicious cycle of retrying PRs. Instead wait longer to become leader.
Bumps the global timeout to 5h after aggregating: 3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else)
Remove vestigates of consul - it's no longer running anywhere.
Disables testing on Autopilot 1.26 to work around CI flaky on Autopilot 1.26 #3058. Will reenable in Rework game server health initial delay handling #3046.

agones-bot · 2023-03-31T22:40:21Z

Build Failed 😱

Build Id: 99b6ccdf-3e84-4532-9760-0e2cba467a32

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2023-04-01T02:20:45Z

Build Succeeded 👏

Build Id: a2550e3b-5312-4b8f-a749-ae3618ee84ed

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.31.0-81c99ad-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.31.0-81c99ad-linux-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.31.0-81c99ad-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.31.0-81c99ad-amd64
Linux C++ SDK (build): agonessdk-1.31.0-81c99ad-amd64-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.31.0-81c99ad-amd64.zip

A preview of the website (the last 30 builds are retained):

https://81c99ad-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3059/head:pr_3059 && git checkout pr_3059
helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-81c99ad-amd64

agones-bot · 2023-04-03T18:37:32Z

Build Failed 😱

Build Id: f40cfb94-2f8e-4473-9d47-5e99e6a357f6

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2023-04-03T20:22:30Z

Build Failed 😱

Build Id: 02966393-ed9a-4faa-a624-b32afc20c9b1

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2023-04-03T22:09:08Z

Build Failed 😱

Build Id: ca82581f-ca96-45b0-8796-a6da0744aa61

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

This PR reworks the e2e timeouts to allow for more time for a given build to wait to run e2es, but tightens the e2e deadline slightly: * Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming in close to an hour in some cases but now that we're not running consul, we don't need it as high as 2h. I don't think it's worth tightening this all the way to an hour, though it would probably work. * Also drops the queueTtl for the CI sub-builds, these should not be queued for long since we serialize e2es now. * Extends the e2e-wait-to-become-leader timeout to 3h. In higher traffic times, we're hitting this limit often now, which only results in a vicious cycle of retrying PRs. Instead wait longer to become leader. * Bumps the global timeout to 5h after aggregating: 3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else) * Remove vestigates of consul - it's no longer running anywhere.

google-oss-prow · 2023-04-03T22:40:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gongmax, zmerlynn
Once this PR has been reviewed and has the lgtm label, please assign ericfortin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

agones-bot · 2023-04-03T23:34:35Z

Build Succeeded 👏

Build Id: a8173545-828e-44bd-a45c-c10faf4e06b3

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.31.0-d713c7a-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.31.0-d713c7a-linux-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.31.0-d713c7a-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.31.0-d713c7a-amd64
Linux C++ SDK (build): agonessdk-1.31.0-d713c7a-amd64-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.31.0-d713c7a-amd64.zip

A preview of the website (the last 30 builds are retained):

https://d713c7a-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3059/head:pr_3059 && git checkout pr_3059
helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-d713c7a-amd64

* Rework health check handling of InitialDelaySeconds See #2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in #3059)

This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

* Rework game server health initial delay handling This is a redrive of #3046, which was reverted in #3068 Rework health check handling of InitialDelaySeconds. See #2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in #3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix #2445 as well

…orgames#3059) * Extend e2e queue timings This PR reworks the e2e timeouts to allow for more time for a given build to wait to run e2es, but tightens the e2e deadline slightly: * Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming in close to an hour in some cases but now that we're not running consul, we don't need it as high as 2h. I don't think it's worth tightening this all the way to an hour, though it would probably work. * Also drops the queueTtl for the CI sub-builds, these should not be queued for long since we serialize e2es now. * Extends the e2e-wait-to-become-leader timeout to 3h. In higher traffic times, we're hitting this limit often now, which only results in a vicious cycle of retrying PRs. Instead wait longer to become leader. * Bumps the global timeout to 5h after aggregating: 3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else) * Remove vestigates of consul - it's no longer running anywhere. * Stop testing on Autopilot 1.26 until after googleforgames#3046

* Rework health check handling of InitialDelaySeconds See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

* Rework game server health initial delay handling This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well

google-oss-prow bot requested review from aLekSer and markmandel March 31, 2023 21:04

zmerlynn requested a review from gongmax March 31, 2023 21:04

google-oss-prow bot added the size/S label Mar 31, 2023

zmerlynn assigned gongmax Mar 31, 2023

zmerlynn removed request for markmandel and aLekSer March 31, 2023 21:04

gongmax approved these changes Mar 31, 2023

View reviewed changes

google-oss-prow bot added the lgtm label Mar 31, 2023

zmerlynn enabled auto-merge (squash) March 31, 2023 21:25

zmerlynn force-pushed the extend-timeouts branch from 81c99ad to 1a0d261 Compare April 3, 2023 16:01

google-oss-prow bot removed the lgtm label Apr 3, 2023

zmerlynn added 2 commits April 3, 2023 22:31

Stop testing on Autopilot 1.26 until after googleforgames#3046

d713c7a

zmerlynn force-pushed the extend-timeouts branch from 1a0d261 to d713c7a Compare April 3, 2023 22:36

google-oss-prow bot added size/M and removed size/S labels Apr 3, 2023

zmerlynn changed the title ~~Extend e2e queue timings~~ Extend e2e queue timings / Disable testing on Autopilot 1.26 Apr 3, 2023

gongmax approved these changes Apr 3, 2023

View reviewed changes

google-oss-prow bot added the lgtm label Apr 3, 2023

zmerlynn disabled auto-merge April 3, 2023 22:40

zmerlynn enabled auto-merge (squash) April 3, 2023 22:40

zmerlynn merged commit 23e14d5 into googleforgames:main Apr 3, 2023

zmerlynn deleted the extend-timeouts branch April 3, 2023 23:43

zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 3, 2023

Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

95d89ba

zmerlynn mentioned this pull request Apr 5, 2023

Rework game server health initial delay handling #3072

Merged

Kalaiselvi84 added the area/tests Unit tests, e2e tests, anything to make sure things don't break label Apr 10, 2023

Kalaiselvi84 added this to the 1.31.0 milestone Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059

Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059

zmerlynn commented Mar 31, 2023 •

edited

Loading

agones-bot commented Mar 31, 2023

agones-bot commented Apr 1, 2023

agones-bot commented Apr 3, 2023

agones-bot commented Apr 3, 2023

agones-bot commented Apr 3, 2023

google-oss-prow bot commented Apr 3, 2023

agones-bot commented Apr 3, 2023

Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059

Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059

Conversation

zmerlynn commented Mar 31, 2023 • edited Loading

agones-bot commented Mar 31, 2023

agones-bot commented Apr 1, 2023

agones-bot commented Apr 3, 2023

agones-bot commented Apr 3, 2023

agones-bot commented Apr 3, 2023

google-oss-prow bot commented Apr 3, 2023

agones-bot commented Apr 3, 2023

zmerlynn commented Mar 31, 2023 •

edited

Loading