-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework game server health initial delay handling #3072
Conversation
Build Failed 😱 Build Id: 71a15fb2-204a-47ad-abc0-9989b0daca30 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 3ef32307-8f10-43c5-8093-cc241d9edf8e To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Succeeded 👏 Build Id: 52ce88fb-78d7-40ed-bdab-7d88f75526d4 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
@@ -860,6 +867,10 @@ func (c *Controller) syncGameServerRequestReadyState(ctx context.Context, gs *ag | |||
break | |||
} | |||
} | |||
// Verify that we found the game server container - we may have a stale cache where pod is missing ContainerStatuses. | |||
if _, ok := gsCopy.ObjectMeta.Annotations[agonesv1.GameServerReadyContainerIDAnnotation]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
I don't see any issues - code looks good. I'll wait on CI before approving, see what we managed to fix! |
Build Succeeded 👏 Build Id: 7dd1b65f-c087-4159-b161-24ff2ba0a2e4 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Failed 😱 Build Id: 6a589602-3a78-4026-938f-9ce3b5476641 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 368c1903-6365-410e-96d1-9ce13166322c To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: markmandel, zmerlynn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Build Succeeded 👏 Build Id: a8b0f538-3134-4514-b09d-dafd743e5be7 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
New changes are detected. LGTM label has been removed. |
Build Succeeded 👏 Build Id: 065c83dc-f4ac-4cd4-9839-e953879e0d37 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
* Rework game server health initial delay handling This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well
…ckly * waitForConnection is using the context provided by NewSDKServerContext, but that context isn't closed until we see a Shutdown(). That's not right - we want to restart the SDK server quickly, which might save the game server in this odd case. So, move waitForConnection up into main so that we can use the "normal" context. * Additionally, I noticed that when we went into graceful termination, the liveness probes stopped! Agghhh. So, this reverts part of googleforgames#3072: instead of using /gshealthz as our heartbeat, fall back to using a go routine. But we still honor the intent of googleforgames#3072: health checks should start when we see the first /gshealthz. * Along the way, add some logging, and make it consistently "SDK" in logging.
…ckly (#3157) * sdkserver: When waitForConnection fails, container should restart quickly * waitForConnection is using the context provided by NewSDKServerContext, but that context isn't closed until we see a Shutdown(). That's not right - we want to restart the SDK server quickly, which might save the game server in this odd case. So, move waitForConnection up into main so that we can use the "normal" context. * Additionally, I noticed that when we went into graceful termination, the liveness probes stopped! Agghhh. So, this reverts part of #3072: instead of using /gshealthz as our heartbeat, fall back to using a go routine. But we still honor the intent of #3072: health checks should start when we see the first /gshealthz. * Along the way, add some logging, and make it consistently "SDK" in logging. ---------
This again reverts part of googleforgames#3072. In looking at recent failures, in cases where the sidecar is slow to come up due to networking issues, /gshealthz fails, resulting often in the unnecessary restart of the game server. This returns to the more generous failure threshold from prior to to just make this ~infinite, i.e. neuter kubelet liveness, since health is really owned by the controller and sidecar.
This again reverts part of googleforgames#3072. In looking at recent failures, in cases where the sidecar is slow to come up due to networking issues, /gshealthz fails, resulting often in the unnecessary restart of the game server. This returns to the more generous failure threshold from prior to to just make this ~infinite, i.e. neuter kubelet liveness, since health is really owned by the controller and sidecar.
This again reverts part of googleforgames#3072. In looking at recent failures, in cases where the sidecar is slow to come up due to networking issues, /gshealthz fails, resulting often in the unnecessary restart of the game server. This returns to the more generous failure threshold from prior to to just make this ~infinite, i.e. neuter kubelet liveness, since health is really owned by the controller and sidecar.
This again reverts part of #3072. In looking at recent failures, in cases where the sidecar is slow to come up due to networking issues, /gshealthz fails, resulting often in the unnecessary restart of the game server. This returns to the more generous failure threshold from prior to to just make this ~infinite, i.e. neuter kubelet liveness, since health is really owned by the controller and sidecar.
This is a redrive of #3046, which was reverted in #3068
Rework health check handling of InitialDelaySeconds. See #2966 (comment):
We remove any knowledge in the SDK of InitialDelaySeconds
We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler
Along the way:
I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy.
Close race if enqueueState is called rapidly before update can succeed
Re-add Autopilot 1.26 to test matrix (removed in Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059)
Fixes #2966