Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression] Fleet scale down didn't adhere to Packed Scheduling #638

Merged
merged 2 commits into from
Mar 15, 2019

Conversation

markmandel
Copy link
Member

The new Fleet scale up/scale down performance enhancements removed the functionality that the Packed strategy is documented to provide -- when scaling down, removing GameServers from least used Nodes first.

This change implements a library for GameServer sorting that doesn't use reflection, unlike sort.Slice (much faster), and can be used with any Comparator, as I expect we'll likely need this again.

This also pulls out the GameServerCounter controller - which keeps a running total of GameServers per Node, to make this kind of sorting much faster and easier as we don't have to compute this on every iteration.

To maintain performance as well, the set of GameServers only get sorted when scaling down, and only for Packed strategy.

@markmandel markmandel added kind/bug These are bugs. area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Mar 6, 2019
@markmandel markmandel added this to the 0.9.0 milestone Mar 6, 2019
@markmandel markmandel requested a review from jkowalski March 6, 2019 17:52
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 88eec1b5-2e46-40bc-a74d-ac6d97fc618a

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/638/head:pr_638 && git checkout pr_638
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-406f1b6

@@ -353,33 +356,38 @@ func (c *Controller) syncGameServerSet(key string) error {

// computeReconciliationAction computes the action to take to reconcile a game server set set given
// the list of game servers that were found and target replica count.
func computeReconciliationAction(list []*v1alpha1.GameServer, targetReplicaCount int, maxCreations int, maxDeletions int, maxPending int) (int, []*v1alpha1.GameServer, bool) {
var upCount int // up == Ready or will become ready
func (c *Controller) computeReconciliationAction(gsSet *v1alpha1.GameServerSet, list []*v1alpha1.GameServer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of binding to "c" please pass counts as an argument. easier to test this as a pure function vs function that can freely reach into "c".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one idea: pass "deleteOrdering func(a, b *v1alpha1.GameServer)" and have the caller pass either the one that sorts based on time or node fullness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running as Distributed, we're grabbing a random set of gameservers to delete to bring us back down under the declared amount we want. We can sort by time, but it didn't seem like very valuable an operation, given the relatively random nature of the selection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of binding to "c" please pass counts as an argument. easier to test this as a pure function vs function that can freely reach into "c".

The only downside there, is that we need to grab a copy of Counts on every pass through, rather than just on scale down, when Packed. It's relatively small operation, but it does involve a lock and a datastructure copy. Are we happy with that? Doing it this way limited any overhead on scale up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not CPU-bound at this point and this would be happening once per GSS, I would not worry about performance here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - I'll make the change 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and done!


// track the number of pods that are being created at any given moment by the GameServerSet
// so we can limit it at a throughput that Kubernetes can handle
var podPendingCount int // podPending == "up" but don't have a Pod running yet

var potentialDeletions []*v1alpha1.GameServer
var toDelete []*v1alpha1.GameServer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you need this change? just sort accordingly where we were sorting before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we took this away, we would need to push everything that could be deleted into toDelete, sort that and then trim down to min(deletedCount, `maxDeletions) from there.

Also, if there are GameServers that are in Error on Unhealthy states, they may get pushed back / would need to be specially sorted to the front.

TL;DR - I think it makes the code harder to read, and harder to keep separate and test to track everything in toDelete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's exactly what we do today, I don't think performance is a problem. Sorting 10000 things on a modern CPU is basically nothing (they can do billions of things per second per core...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding of the code, we were getting a random sample of gameservers to delete (since and informer list is in random order) - and then sorting that random set after the fact - which doesn't.

Whereas this is tracking everything that could be deleted, and then sorting from there, which gives us a stronger guarantee of order of deletion by removing the random aspect.

Even if I did remove potentialAllocations, I'm just replacing it with toDelete but the code wouldn't be the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry and to be clear - this isn't a performance issue - it's that the current implementation provides us with a random set of gameservers to delete, there's no way to guarantee anything near to consistent delete ordering when running Packed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also updated back in the newest sort for Distributed 👍

@markmandel markmandel force-pushed the bug/regression-packed branch from 406f1b6 to d8a6985 Compare March 6, 2019 22:53
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 07cb0907-8577-4702-8bde-221a6174d784

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel markmandel force-pushed the bug/regression-packed branch from d8a6985 to 75ecf99 Compare March 6, 2019 23:03
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 598eaa2b-23f1-4dfb-b10b-96dd741aa233

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/638/head:pr_638 && git checkout pr_638
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-75ecf99

markmandel added a commit to markmandel/agones that referenced this pull request Mar 7, 2019
Precursor to googleforgames#638 - this moves the AllocationCounter code
to a more central GameServerCounter (which I think is a better name)
that tracks the number of Ready and Allocated GameServers that are
available on each node.

These details are useful for sorting for `Packed` scheduling strategies.

Once this PR is completed, we can use this Count() values provided by this
controller in the GameServerSet scale down logic, to ensure that GameServers
on the least used nodes are removed first when using a Packed strategy.
@markmandel
Copy link
Member Author

As requested - #639 is now available - moving the refactor out from this PR, so it can be reviewed independently 👍

markmandel added a commit to markmandel/agones that referenced this pull request Mar 7, 2019
Precursor to googleforgames#638 - this moves the AllocationCounter code
to a more central GameServerCounter (which I think is a better name)
that tracks the number of Ready and Allocated GameServers that are
available on each node.

These details are useful for sorting for `Packed` scheduling strategies.

Once this PR is completed, we can use this Count() values provided by this
controller in the GameServerSet scale down logic, to ensure that GameServers
on the least used nodes are removed first when using a Packed strategy.
markmandel added a commit to markmandel/agones that referenced this pull request Mar 11, 2019
Precursor to googleforgames#638 - this moves the AllocationCounter code
to a more central GameServerCounter (which I think is a better name)
that tracks the number of Ready and Allocated GameServers that are
available on each node.

These details are useful for sorting for `Packed` scheduling strategies.

Once this PR is completed, we can use this Count() values provided by this
controller in the GameServerSet scale down logic, to ensure that GameServers
on the least used nodes are removed first when using a Packed strategy.
markmandel added a commit that referenced this pull request Mar 11, 2019
Precursor to #638 - this moves the AllocationCounter code
to a more central GameServerCounter (which I think is a better name)
that tracks the number of Ready and Allocated GameServers that are
available on each node.

These details are useful for sorting for `Packed` scheduling strategies.

Once this PR is completed, we can use this Count() values provided by this
controller in the GameServerSet scale down logic, to ensure that GameServers
on the least used nodes are removed first when using a Packed strategy.
@markmandel markmandel force-pushed the bug/regression-packed branch from 75ecf99 to 9ba315a Compare March 12, 2019 03:30
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: bc89c29b-2de3-4e53-ad31-b1637031735a

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/638/head:pr_638 && git checkout pr_638
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-9ba315a

@markmandel markmandel force-pushed the bug/regression-packed branch 2 times, most recently from b13a88d to 096e4f9 Compare March 13, 2019 01:10
@markmandel
Copy link
Member Author

Assuming this passes tests, this should be good for re-review. Made changes based on above, and I think this should be better now.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 3c709d12-90be-4e9d-9a40-c880de796b98

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/638/head:pr_638 && git checkout pr_638
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-096e4f9

@@ -201,7 +201,7 @@ func main() {
kubeClient, extClient, agonesClient, agonesInformerFactory)

rs = append(rs,
httpsServer, gsCounter, gsController, gsSetController, fleetController, faController, fasController, gasController, server)
httpsServer, gsCounter, gsCounter, gsController, gsSetController, fleetController, faController, fasController, gasController, server)
Copy link
Collaborator

@aLekSer aLekSer Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two gsCounter in one rs slice? Possibly we need to add a comment for that?
Also I think that comment in perNodeCounter Run() should be updated, because it is now blocking with mutex:
https://github.com/GoogleCloudPlatform/agones/blob/b133e5280d2bec4bee98f78f59561cc2dc5a7171/pkg/gameservers/prenodecounter.go#L136

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - that looks like a typo on my end. Let me remove that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The comment was to say that the function doesn't block on stop (which many Run functions do) - I can change it to be more clear?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, I think no need to change comment, cause this file was not touched here. 🙇

The new Fleet scale up/scale down performance enhancements removed
the functionality that the `Packed` strategy is documented to
provide -- when scaling down, removing GameServers from least used Nodes
first.

To maintain performance as well, the set of GameServers only get sorted when
scaling down, and only for Packed strategy.
@markmandel markmandel force-pushed the bug/regression-packed branch from 096e4f9 to 5768ea3 Compare March 14, 2019 15:36
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 03c00ddc-545d-4a76-84e8-7ebecd72498e

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/638/head:pr_638 && git checkout pr_638
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-5768ea3

Copy link
Collaborator

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: fa17a804-5465-43cc-a9d5-781b43ff1781

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/638/head:pr_638 && git checkout pr_638
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-0da4e46

@markmandel markmandel merged commit 49d475a into googleforgames:master Mar 15, 2019
@markmandel markmandel deleted the bug/regression-packed branch March 15, 2019 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants