Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework game server health initial delay handling #3072

Merged
merged 3 commits into from
Apr 6, 2023

Conversation

zmerlynn
Copy link
Collaborator

@zmerlynn zmerlynn commented Apr 5, 2023

This is a redrive of #3046, which was reverted in #3068

Rework health check handling of InitialDelaySeconds. See #2966 (comment):

  • We remove any knowledge in the SDK of InitialDelaySeconds

  • We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler

Along the way:

  • I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy.

  • Close race if enqueueState is called rapidly before update can succeed

  • Re-add Autopilot 1.26 to test matrix (removed in Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059)

Fixes #2966

@zmerlynn zmerlynn removed the request for review from aLekSer April 5, 2023 16:57
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 71a15fb2-204a-47ad-abc0-9989b0daca30

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 3ef32307-8f10-43c5-8093-cc241d9edf8e

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 52ce88fb-78d7-40ed-bdab-7d88f75526d4

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3072/head:pr_3072 && git checkout pr_3072
  • helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-5102199-amd64

@@ -860,6 +867,10 @@ func (c *Controller) syncGameServerRequestReadyState(ctx context.Context, gs *ag
break
}
}
// Verify that we found the game server container - we may have a stale cache where pod is missing ContainerStatuses.
if _, ok := gsCopy.ObjectMeta.Annotations[agonesv1.GameServerReadyContainerIDAnnotation]; !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@markmandel
Copy link
Member

I don't see any issues - code looks good. I'll wait on CI before approving, see what we managed to fix!

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 7dd1b65f-c087-4159-b161-24ff2ba0a2e4

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3072/head:pr_3072 && git checkout pr_3072
  • helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-5ac3dc6-amd64

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 6a589602-3a78-4026-938f-9ce3b5476641

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 368c1903-6365-410e-96d1-9ce13166322c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

zmerlynn added 2 commits April 6, 2023 16:33
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix googleforgames#2445 as well
Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this

@google-oss-prow google-oss-prow bot added the lgtm label Apr 6, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmandel, zmerlynn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: a8b0f538-3134-4514-b09d-dafd743e5be7

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3072/head:pr_3072 && git checkout pr_3072
  • helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-f01a862-amd64

@google-oss-prow google-oss-prow bot removed the lgtm label Apr 6, 2023
@google-oss-prow
Copy link

New changes are detected. LGTM label has been removed.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 065c83dc-f4ac-4cd4-9839-e953879e0d37

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3072/head:pr_3072 && git checkout pr_3072
  • helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-4ddd361-amd64

@zmerlynn zmerlynn merged commit 26647a0 into googleforgames:main Apr 6, 2023
@zmerlynn zmerlynn deleted the reapply-3046 branch April 6, 2023 20:43
@Kalaiselvi84 Kalaiselvi84 added the kind/bug These are bugs. label Apr 10, 2023
@Kalaiselvi84 Kalaiselvi84 added this to the 1.31.0 milestone Apr 10, 2023
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this pull request Apr 11, 2023
* Rework game server health initial delay handling

This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix googleforgames#2445 as well
zmerlynn added a commit to zmerlynn/agones that referenced this pull request May 16, 2023
…ckly

* waitForConnection is using the context provided by
NewSDKServerContext, but that context isn't closed until we see a
Shutdown(). That's not right - we want to restart the SDK server
quickly, which might save the game server in this odd case. So, move
waitForConnection up into main so that we can use the "normal"
context.

* Additionally, I noticed that when we went into graceful termination,
the liveness probes stopped! Agghhh. So, this reverts part of googleforgames#3072:
instead of using /gshealthz as our heartbeat, fall back to using a go
routine. But we still honor the intent of googleforgames#3072: health checks should
start when we see the first /gshealthz.

* Along the way, add some logging, and make it consistently "SDK" in
logging.
gongmax pushed a commit that referenced this pull request May 17, 2023
…ckly (#3157)

* sdkserver: When waitForConnection fails, container should restart quickly

* waitForConnection is using the context provided by
NewSDKServerContext, but that context isn't closed until we see a
Shutdown(). That's not right - we want to restart the SDK server
quickly, which might save the game server in this odd case. So, move
waitForConnection up into main so that we can use the "normal"
context.

* Additionally, I noticed that when we went into graceful termination,
the liveness probes stopped! Agghhh. So, this reverts part of #3072:
instead of using /gshealthz as our heartbeat, fall back to using a go
routine. But we still honor the intent of #3072: health checks should
start when we see the first /gshealthz.

* Along the way, add some logging, and make it consistently "SDK" in
logging.

---------
zmerlynn added a commit to zmerlynn/agones that referenced this pull request May 17, 2023
This again reverts part of googleforgames#3072. In looking at recent failures, in
cases where the sidecar is slow to come up due to networking issues,
/gshealthz fails, resulting often in the unnecessary restart of the
game server.

This returns to the more generous failure threshold from prior to
to just make this ~infinite, i.e. neuter kubelet liveness, since
health is really owned by the controller and sidecar.
zmerlynn added a commit to zmerlynn/agones that referenced this pull request May 17, 2023
This again reverts part of googleforgames#3072. In looking at recent failures, in
cases where the sidecar is slow to come up due to networking issues,
/gshealthz fails, resulting often in the unnecessary restart of the
game server.

This returns to the more generous failure threshold from prior to
to just make this ~infinite, i.e. neuter kubelet liveness, since
health is really owned by the controller and sidecar.
zmerlynn added a commit to zmerlynn/agones that referenced this pull request May 17, 2023
This again reverts part of googleforgames#3072. In looking at recent failures, in
cases where the sidecar is slow to come up due to networking issues,
/gshealthz fails, resulting often in the unnecessary restart of the
game server.

This returns to the more generous failure threshold from prior to
to just make this ~infinite, i.e. neuter kubelet liveness, since
health is really owned by the controller and sidecar.
zmerlynn added a commit that referenced this pull request May 17, 2023
This again reverts part of #3072. In looking at recent failures, in
cases where the sidecar is slow to come up due to networking issues,
/gshealthz fails, resulting often in the unnecessary restart of the
game server.

This returns to the more generous failure threshold from prior to
to just make this ~infinite, i.e. neuter kubelet liveness, since
health is really owned by the controller and sidecar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agones health check shouldn't fail during game server container image pull
4 participants