-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Regression] Fleet scale down didn't adhere to Packed Scheduling #638
[Regression] Fleet scale down didn't adhere to Packed Scheduling #638
Conversation
Build Succeeded 👏 Build Id: 88eec1b5-2e46-40bc-a74d-ac6d97fc618a The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
pkg/gameserversets/controller.go
Outdated
@@ -353,33 +356,38 @@ func (c *Controller) syncGameServerSet(key string) error { | |||
|
|||
// computeReconciliationAction computes the action to take to reconcile a game server set set given | |||
// the list of game servers that were found and target replica count. | |||
func computeReconciliationAction(list []*v1alpha1.GameServer, targetReplicaCount int, maxCreations int, maxDeletions int, maxPending int) (int, []*v1alpha1.GameServer, bool) { | |||
var upCount int // up == Ready or will become ready | |||
func (c *Controller) computeReconciliationAction(gsSet *v1alpha1.GameServerSet, list []*v1alpha1.GameServer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of binding to "c" please pass counts as an argument. easier to test this as a pure function vs function that can freely reach into "c".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one idea: pass "deleteOrdering func(a, b *v1alpha1.GameServer)" and have the caller pass either the one that sorts based on time or node fullness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running as Distributed
, we're grabbing a random set of gameservers to delete to bring us back down under the declared amount we want. We can sort by time, but it didn't seem like very valuable an operation, given the relatively random nature of the selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of binding to "c" please pass counts as an argument. easier to test this as a pure function vs function that can freely reach into "c".
The only downside there, is that we need to grab a copy of Counts on every pass through, rather than just on scale down, when Packed. It's relatively small operation, but it does involve a lock and a datastructure copy. Are we happy with that? Doing it this way limited any overhead on scale up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not CPU-bound at this point and this would be happening once per GSS, I would not worry about performance here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - I'll make the change 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and done!
|
||
// track the number of pods that are being created at any given moment by the GameServerSet | ||
// so we can limit it at a throughput that Kubernetes can handle | ||
var podPendingCount int // podPending == "up" but don't have a Pod running yet | ||
|
||
var potentialDeletions []*v1alpha1.GameServer | ||
var toDelete []*v1alpha1.GameServer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why you need this change? just sort accordingly where we were sorting before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we took this away, we would need to push everything that could be deleted into toDelete
, sort that and then trim down to min(deletedCount
, `maxDeletions) from there.
Also, if there are GameServers that are in Error on Unhealthy states, they may get pushed back / would need to be specially sorted to the front.
TL;DR - I think it makes the code harder to read, and harder to keep separate and test to track everything in toDelete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's exactly what we do today, I don't think performance is a problem. Sorting 10000 things on a modern CPU is basically nothing (they can do billions of things per second per core...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding of the code, we were getting a random sample of gameservers to delete (since and informer list is in random order) - and then sorting that random set after the fact - which doesn't.
Whereas this is tracking everything that could be deleted, and then sorting from there, which gives us a stronger guarantee of order of deletion by removing the random aspect.
Even if I did remove potentialAllocations
, I'm just replacing it with toDelete
but the code wouldn't be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry and to be clear - this isn't a performance issue - it's that the current implementation provides us with a random set of gameservers to delete, there's no way to guarantee anything near to consistent delete ordering when running Packed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also updated back in the newest sort for Distributed 👍
406f1b6
to
d8a6985
Compare
Build Failed 😱 Build Id: 07cb0907-8577-4702-8bde-221a6174d784 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
d8a6985
to
75ecf99
Compare
Build Succeeded 👏 Build Id: 598eaa2b-23f1-4dfb-b10b-96dd741aa233 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Precursor to googleforgames#638 - this moves the AllocationCounter code to a more central GameServerCounter (which I think is a better name) that tracks the number of Ready and Allocated GameServers that are available on each node. These details are useful for sorting for `Packed` scheduling strategies. Once this PR is completed, we can use this Count() values provided by this controller in the GameServerSet scale down logic, to ensure that GameServers on the least used nodes are removed first when using a Packed strategy.
As requested - #639 is now available - moving the refactor out from this PR, so it can be reviewed independently 👍 |
Precursor to googleforgames#638 - this moves the AllocationCounter code to a more central GameServerCounter (which I think is a better name) that tracks the number of Ready and Allocated GameServers that are available on each node. These details are useful for sorting for `Packed` scheduling strategies. Once this PR is completed, we can use this Count() values provided by this controller in the GameServerSet scale down logic, to ensure that GameServers on the least used nodes are removed first when using a Packed strategy.
Precursor to googleforgames#638 - this moves the AllocationCounter code to a more central GameServerCounter (which I think is a better name) that tracks the number of Ready and Allocated GameServers that are available on each node. These details are useful for sorting for `Packed` scheduling strategies. Once this PR is completed, we can use this Count() values provided by this controller in the GameServerSet scale down logic, to ensure that GameServers on the least used nodes are removed first when using a Packed strategy.
Precursor to #638 - this moves the AllocationCounter code to a more central GameServerCounter (which I think is a better name) that tracks the number of Ready and Allocated GameServers that are available on each node. These details are useful for sorting for `Packed` scheduling strategies. Once this PR is completed, we can use this Count() values provided by this controller in the GameServerSet scale down logic, to ensure that GameServers on the least used nodes are removed first when using a Packed strategy.
75ecf99
to
9ba315a
Compare
Build Succeeded 👏 Build Id: bc89c29b-2de3-4e53-ad31-b1637031735a The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
b13a88d
to
096e4f9
Compare
Assuming this passes tests, this should be good for re-review. Made changes based on above, and I think this should be better now. |
Build Succeeded 👏 Build Id: 3c709d12-90be-4e9d-9a40-c880de796b98 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
cmd/controller/main.go
Outdated
@@ -201,7 +201,7 @@ func main() { | |||
kubeClient, extClient, agonesClient, agonesInformerFactory) | |||
|
|||
rs = append(rs, | |||
httpsServer, gsCounter, gsController, gsSetController, fleetController, faController, fasController, gasController, server) | |||
httpsServer, gsCounter, gsCounter, gsController, gsSetController, fleetController, faController, fasController, gasController, server) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two gsCounter
in one rs
slice? Possibly we need to add a comment for that?
Also I think that comment in perNodeCounter
Run() should be updated, because it is now blocking with mutex:
https://github.com/GoogleCloudPlatform/agones/blob/b133e5280d2bec4bee98f78f59561cc2dc5a7171/pkg/gameservers/prenodecounter.go#L136
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops - that looks like a typo on my end. Let me remove that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. The comment was to say that the function doesn't block on stop
(which many Run functions do) - I can change it to be more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, I think no need to change comment, cause this file was not touched here. 🙇
The new Fleet scale up/scale down performance enhancements removed the functionality that the `Packed` strategy is documented to provide -- when scaling down, removing GameServers from least used Nodes first. To maintain performance as well, the set of GameServers only get sorted when scaling down, and only for Packed strategy.
096e4f9
to
5768ea3
Compare
Build Succeeded 👏 Build Id: 03c00ddc-545d-4a76-84e8-7ebecd72498e The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Build Succeeded 👏 Build Id: fa17a804-5465-43cc-a9d5-781b43ff1781 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
The new Fleet scale up/scale down performance enhancements removed the functionality that the
Packed
strategy is documented to provide -- when scaling down, removing GameServers from least used Nodes first.This change implements a library for GameServer sorting that doesn't use reflection, unlike sort.Slice (much faster), and can be used with any Comparator, as I expect we'll likely need this again.
This also pulls out the GameServerCounter controller - which keeps a running total of GameServers per Node, to make this kind of sorting much faster and easier as we don't have to compute this on every iteration.
To maintain performance as well, the set of GameServers only get sorted when scaling down, and only for Packed strategy.