Rework game server health initial delay handling #3072

zmerlynn · 2023-04-05T16:57:14Z

This is a redrive of #3046, which was reverted in #3068

Rework health check handling of InitialDelaySeconds. See #2966 (comment):

We remove any knowledge in the SDK of InitialDelaySeconds
We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler

Along the way:

I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy.
Close race if enqueueState is called rapidly before update can succeed
Re-add Autopilot 1.26 to test matrix (removed in Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059)

Fixes #2966

agones-bot · 2023-04-05T17:48:39Z

Build Failed 😱

Build Id: 71a15fb2-204a-47ad-abc0-9989b0daca30

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2023-04-05T18:39:05Z

Build Failed 😱

Build Id: 3ef32307-8f10-43c5-8093-cc241d9edf8e

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2023-04-05T20:19:42Z

Build Succeeded 👏

Build Id: 52ce88fb-78d7-40ed-bdab-7d88f75526d4

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.31.0-5102199-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.31.0-5102199-linux-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.31.0-5102199-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.31.0-5102199-amd64
Linux C++ SDK (build): agonessdk-1.31.0-5102199-amd64-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.31.0-5102199-amd64.zip

A preview of the website (the last 30 builds are retained):

https://5102199-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3072/head:pr_3072 && git checkout pr_3072
helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-5102199-amd64

markmandel · 2023-04-05T23:28:40Z

pkg/gameservers/controller.go

@@ -860,6 +867,10 @@ func (c *Controller) syncGameServerRequestReadyState(ctx context.Context, gs *ag
 			break
 		}
 	}
+	// Verify that we found the game server container - we may have a stale cache where pod is missing ContainerStatuses.
+	if _, ok := gsCopy.ObjectMeta.Annotations[agonesv1.GameServerReadyContainerIDAnnotation]; !ok {


markmandel · 2023-04-05T23:30:23Z

I don't see any issues - code looks good. I'll wait on CI before approving, see what we managed to fix!

agones-bot · 2023-04-06T00:25:50Z

Build Succeeded 👏

Build Id: 7dd1b65f-c087-4159-b161-24ff2ba0a2e4

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.31.0-5ac3dc6-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.31.0-5ac3dc6-linux-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.31.0-5ac3dc6-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.31.0-5ac3dc6-amd64
Linux C++ SDK (build): agonessdk-1.31.0-5ac3dc6-amd64-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.31.0-5ac3dc6-amd64.zip

A preview of the website (the last 30 builds are retained):

https://5ac3dc6-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3072/head:pr_3072 && git checkout pr_3072
helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-5ac3dc6-amd64

agones-bot · 2023-04-06T16:10:37Z

Build Failed 😱

Build Id: 6a589602-3a78-4026-938f-9ce3b5476641

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2023-04-06T16:26:14Z

Build Failed 😱

Build Id: 368c1903-6365-410e-96d1-9ce13166322c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well

markmandel

google-oss-prow · 2023-04-06T17:58:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmandel, zmerlynn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [markmandel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

agones-bot · 2023-04-06T18:20:34Z

Build Succeeded 👏

Build Id: a8b0f538-3134-4514-b09d-dafd743e5be7

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.31.0-f01a862-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.31.0-f01a862-linux-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.31.0-f01a862-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.31.0-f01a862-amd64
Linux C++ SDK (build): agonessdk-1.31.0-f01a862-amd64-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.31.0-f01a862-amd64.zip

A preview of the website (the last 30 builds are retained):

https://f01a862-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3072/head:pr_3072 && git checkout pr_3072
helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-f01a862-amd64

google-oss-prow · 2023-04-06T18:34:36Z

New changes are detected. LGTM label has been removed.

agones-bot · 2023-04-06T20:38:01Z

Build Succeeded 👏

Build Id: 065c83dc-f4ac-4cd4-9839-e953879e0d37

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.31.0-4ddd361-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.31.0-4ddd361-linux-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.31.0-4ddd361-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.31.0-4ddd361-amd64
Linux C++ SDK (build): agonessdk-1.31.0-4ddd361-amd64-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.31.0-4ddd361-amd64.zip

A preview of the website (the last 30 builds are retained):

https://4ddd361-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3072/head:pr_3072 && git checkout pr_3072
helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-4ddd361-amd64

* Rework game server health initial delay handling This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well

…ckly * waitForConnection is using the context provided by NewSDKServerContext, but that context isn't closed until we see a Shutdown(). That's not right - we want to restart the SDK server quickly, which might save the game server in this odd case. So, move waitForConnection up into main so that we can use the "normal" context. * Additionally, I noticed that when we went into graceful termination, the liveness probes stopped! Agghhh. So, this reverts part of googleforgames#3072: instead of using /gshealthz as our heartbeat, fall back to using a go routine. But we still honor the intent of googleforgames#3072: health checks should start when we see the first /gshealthz. * Along the way, add some logging, and make it consistently "SDK" in logging.

…ckly (#3157) * sdkserver: When waitForConnection fails, container should restart quickly * waitForConnection is using the context provided by NewSDKServerContext, but that context isn't closed until we see a Shutdown(). That's not right - we want to restart the SDK server quickly, which might save the game server in this odd case. So, move waitForConnection up into main so that we can use the "normal" context. * Additionally, I noticed that when we went into graceful termination, the liveness probes stopped! Agghhh. So, this reverts part of #3072: instead of using /gshealthz as our heartbeat, fall back to using a go routine. But we still honor the intent of #3072: health checks should start when we see the first /gshealthz. * Along the way, add some logging, and make it consistently "SDK" in logging. ---------

This again reverts part of googleforgames#3072. In looking at recent failures, in cases where the sidecar is slow to come up due to networking issues, /gshealthz fails, resulting often in the unnecessary restart of the game server. This returns to the more generous failure threshold from prior to to just make this ~infinite, i.e. neuter kubelet liveness, since health is really owned by the controller and sidecar.

This again reverts part of #3072. In looking at recent failures, in cases where the sidecar is slow to come up due to networking issues, /gshealthz fails, resulting often in the unnecessary restart of the game server. This returns to the more generous failure threshold from prior to to just make this ~infinite, i.e. neuter kubelet liveness, since health is really owned by the controller and sidecar.

zmerlynn assigned markmandel Apr 5, 2023

google-oss-prow bot requested review from aLekSer and roberthbailey April 5, 2023 16:57

google-oss-prow bot added the size/L label Apr 5, 2023

zmerlynn force-pushed the reapply-3046 branch from 59d767a to 9be5126 Compare April 5, 2023 16:57

zmerlynn removed the request for review from aLekSer April 5, 2023 16:57

zmerlynn force-pushed the reapply-3046 branch from 5102199 to 5ac3dc6 Compare April 5, 2023 23:25

markmandel reviewed Apr 5, 2023

View reviewed changes

zmerlynn added 2 commits April 6, 2023 16:33

zmerlynn force-pushed the reapply-3046 branch from 0280747 to f01a862 Compare April 6, 2023 16:33

markmandel approved these changes Apr 6, 2023

View reviewed changes

google-oss-prow bot added the lgtm label Apr 6, 2023

google-oss-prow bot added the approved label Apr 6, 2023

Merge branch 'main' into reapply-3046

4ddd361

google-oss-prow bot removed the lgtm label Apr 6, 2023

zmerlynn merged commit 26647a0 into googleforgames:main Apr 6, 2023

zmerlynn deleted the reapply-3046 branch April 6, 2023 20:43

Kalaiselvi84 added the kind/bug These are bugs. label Apr 10, 2023

Kalaiselvi84 added this to the 1.31.0 milestone Apr 10, 2023

zmerlynn mentioned this pull request May 16, 2023

sdkserver: When waitForConnection fails, container should restart quickly #3157

Merged

zmerlynn mentioned this pull request May 17, 2023

Move back to FailureThreshold failures of /gshealthz #3160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework game server health initial delay handling #3072

Rework game server health initial delay handling #3072

zmerlynn commented Apr 5, 2023

agones-bot commented Apr 5, 2023

agones-bot commented Apr 5, 2023

agones-bot commented Apr 5, 2023

markmandel Apr 5, 2023

markmandel commented Apr 5, 2023

agones-bot commented Apr 6, 2023

agones-bot commented Apr 6, 2023

agones-bot commented Apr 6, 2023

markmandel left a comment

google-oss-prow bot commented Apr 6, 2023

agones-bot commented Apr 6, 2023

google-oss-prow bot commented Apr 6, 2023

agones-bot commented Apr 6, 2023

Rework game server health initial delay handling #3072

Rework game server health initial delay handling #3072

Conversation

zmerlynn commented Apr 5, 2023

agones-bot commented Apr 5, 2023

agones-bot commented Apr 5, 2023

agones-bot commented Apr 5, 2023

markmandel Apr 5, 2023

Choose a reason for hiding this comment

markmandel commented Apr 5, 2023

agones-bot commented Apr 6, 2023

agones-bot commented Apr 6, 2023

agones-bot commented Apr 6, 2023

markmandel left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Apr 6, 2023

agones-bot commented Apr 6, 2023

google-oss-prow bot commented Apr 6, 2023

agones-bot commented Apr 6, 2023