Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve allocation performance #583

Conversation

ilkercelikyilmaz
Copy link
Contributor

Remove the call to sync the gameserver cache during each allocation request and replaced it with using local (ready gameservers) cache.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 4e103398-2b4a-45ce-b013-d4222277ccd3

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@jkowalski
Copy link
Contributor

Removing the mutex has a potential to introduce very subtle bugs (leaks, double allocation, panics), can we add some allocation stress test before making this change? then we can be confident we're not regressing.

How about this e2e scenario:

  • make a large-ish fleet (say 100 GSs), wait for ready
  • start ~10 goroutines, each allocating game servers till we get to exhaustion
  • wait for all to complete
  • verify that:
    • all GS got allocated
    • each time we got a different GS
    • each GS was properly updated

I would perhaps add one more twist: while allocating, we should have game servers update themselves (via SDK.SetLabel()), which will test that we're handling update conflicts in the allocation routine.

@markmandel
Copy link
Member

Maybe that is something that can be added here:
https://github.com/GoogleCloudPlatform/agones/blob/master/test/e2e/fleet_test.go#L373 ?

@ilkercelikyilmaz
Copy link
Contributor Author

Removing the mutex has a potential to introduce very subtle bugs (leaks, double allocation, panics), can we add some allocation stress test before making this change? then we can be confident we're not regressing.

How about this e2e scenario:

  • make a large-ish fleet (say 100 GSs), wait for ready

  • start ~10 goroutines, each allocating game servers till we get to exhaustion

  • wait for all to complete

  • verify that:

    • all GS got allocated
    • each time we got a different GS
    • each GS was properly updated

I would perhaps add one more twist: while allocating, we should have game servers update themselves (via SDK.SetLabel()), which will test that we're handling update conflicts in the allocation routine.

I agree with removing mutex will cause unexpected behaviors/bugs. That's why I am still keeping it but reducing it's scope and blocking it local cache.

I did few manual tests (with parallel allocations). I wanted to get quick feedback before spending more time on tests. I will add e2e test(s).

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 7fd72495-83db-4989-9710-160a9230771b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: ba933819-1f8f-4cac-bcbf-a91db70c245e

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.8.0-b412383

@markmandel markmandel added area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc area/performance Anything to do with Agones being slow, or making it go faster. feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) labels Feb 12, 2019
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: b6eda206-7f70-4044-af0f-908eab190666

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 6bbd38ec-f161-416b-807e-aed5db547897

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 22a163d1-2ba1-4c75-805e-d1e1b32f5434

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: c8d1a308-429d-4a89-a5ba-9064a97e1bda

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.8.0-211f4e2

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: b098a310-293a-4cc6-9caa-d351b06e8a79

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.8.0-7f8c99c

@markmandel markmandel removed feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) labels Feb 21, 2019
@markmandel
Copy link
Member

Looks like you're still actively working on this PR? Just wanted to touch base and check?

@ilkercelikyilmaz
Copy link
Contributor Author

I made quite a bit progress. I am planning to checkin once I merge my changes. I might have more improvements but I am planning to do them in another PR.

@markmandel
Copy link
Member

Okay cool - will hold off doing another review until you give the 👍

Excited to see the improvements!

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: c0054ed9-6665-452e-a656-daada969ba80

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 85d579e1-c743-470c-b6f7-7eef4dad72a2

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-824a20e

@ilkercelikyilmaz
Copy link
Contributor Author

Okay cool - will hold off doing another review until you give the

Excited to see the improvements!

It will be great if you can review. I will like to check-in this version. I might have further improvements in the second round.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 21145968-e5ff-4848-b565-114b7159c97b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@ilkercelikyilmaz
Copy link
Contributor Author

@markmandel , should we create GSA crd for failed gs allocations (with UnAllocated state)?
When I was doing load test, the system ended up creating lot of UnAllocated GSA records and most didn't seem to cleaned-up immediately. It probably slows down the overall processing.
I start returning error when the server can't be allocated but that is failing some tests. I wanted to check with you before I started to cleaning them up. Please let me know what you think.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: d4460919-1c4c-42ca-88cf-13ce2524f4af

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 1639f513-9474-4664-9213-fca987cfbff1

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-889a40e

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 7f2a9520-4b9b-4203-b231-7390af1fe7e4

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-38c3999

@markmandel
Copy link
Member

@markmandel , should we create GSA crd for failed gs allocations (with UnAllocated state)?
When I was doing load test, the system ended up creating lot of UnAllocated GSA records and most didn't seem to cleaned-up immediately. It probably slows down the overall processing.
I start returning error when the server can't be allocated but that is failing some tests. I wanted to check with you before I started to cleaning them up. Please let me know what you think.

#600 should solve this issue, as we totally remove storage.

@ilkercelikyilmaz
Copy link
Contributor Author

@markmandel , should we create GSA crd for failed gs allocations (with UnAllocated state)?
When I was doing load test, the system ended up creating lot of UnAllocated GSA records and most didn't seem to cleaned-up immediately. It probably slows down the overall processing.
I start returning error when the server can't be allocated but that is failing some tests. I wanted to check with you before I started to cleaning them up. Please let me know what you think.

#600 should solve this issue, as we totally remove storage.

Agreed. Just want to make sure. I have already removed the Unallocate GSA creation from the code.

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, 💯 this approach. Is awesome.

Had a few questions and thoughts on a couple of things - but this is great.

cmd/controller/main.go Outdated Show resolved Hide resolved
pkg/gameserverallocations/controller.go Outdated Show resolved Hide resolved
pkg/gameserverallocations/controller.go Show resolved Hide resolved
pkg/gameserverallocations/controller.go Outdated Show resolved Hide resolved
}

// findComparator is a comparator function specifically for the
// findReadyGameServerForAllocation method for determining
// scheduling strategy
type findComparator func(bestCount, currentCount NodeCount) bool

var allocationRetry = wait.Backoff{
Steps: 5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the number of retries be configurable? Just curious how we came up with the number 5. (we have 30s to work in -- although with the webhook, it is serial, so maybe less is better until we get #600 or related) 🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure making this configurable will be helpful without some guidance. Eventually it can be.
Once everything is stable, we can try different values and find a good number.
I get this from here. I was trying different numbers. I agree that it should not be too high. I can set it to 3 for now.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fair - more than anything else, I just wanted to ask the question. I'm happy with 5 - seems like a legitimate number. I feel like I'd rather aim higher, and then maybe pull it down if need be,

@Kuqd curious - at some point, should we add a metric for this somewhere? How many retries?

pkg/gameserverallocations/controller.go Outdated Show resolved Hide resolved
pkg/gameserverallocations/controller.go Show resolved Hide resolved
pkg/gameserverallocations/controller.go Outdated Show resolved Hide resolved
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 02b7541b-7f66-4399-8672-0609fbc433ef

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: cb4d5660-4bfa-48fd-878d-8d7724c5cfe2

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 32745520-1d35-4000-9dff-cc259698c542

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-be94c12

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 62388e8f-e589-4e3d-8380-355d2fabe185

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: a17b55dd-5944-46ab-9896-88a1681e1f6c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: a74c7057-ad35-42aa-a18d-f5fb15fcfb60

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-ef54dd6

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit things, and a thought on documentation to add.

pkg/gameserverallocations/controller.go Show resolved Hide resolved
pkg/gameserverallocations/controller.go Outdated Show resolved Hide resolved
pkg/gameserverallocations/controller_test.go Outdated Show resolved Hide resolved
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: d8a6822e-a514-4c0e-afff-a07596c6a611

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 3fba764a-f0cd-47f2-b265-7b1c50e67986

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 281bdccd-4de1-4dfc-b51d-0c2da4e47049

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-b74731e

@markmandel
Copy link
Member

LGTM. Needs to be squashed to a single commit, and then it's good to go!

@ilkercelikyilmaz ilkercelikyilmaz force-pushed the Improve_Allocation_Performance branch from b74731e to e183529 Compare February 27, 2019 17:21
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: d04ad053-5358-42a5-9433-4747a9aa389f

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/583/head:pr_583 && git checkout pr_583
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.9.0-e183529

@jkowalski jkowalski merged commit bb85584 into googleforgames:master Feb 27, 2019
@markmandel markmandel added this to the 0.9.0 milestone Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Anything to do with Agones being slow, or making it go faster. area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants