-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use batching in GameServerAllocation controller to improve throughput. #536
Comments
As a first step, I'm going to look into moving In theory, there should be no/minimal change to how GSA's work -- at least that's the plan 😄 |
The good news is - I have (some cleanup to do) an API extension working to do gameserver allocations. I've only implemented and supported the CREATE (HTTP: Post) method on the API, as without storage, it's really the only one needed. If people request it, I could look into the Watch function as well, if people want to watch for create events. The annoying news is - each create API call has 60s to provide a response (although can keep processing in the background) -- which makes one of the long term goals, having a SDK.Ack() function for blocking on Allocation return) -- a little trickier. Or at least, with a shorter timeout than I may have liked. Asking the community for feedback on that aspect (slack):
Regardless, this will now also allow us to batch, skip storage for the GSA, etc. And also make it easier if we decide to also provide a gRPC interface as well for allocation. |
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. This also includes some libraries for building further api server extension points.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed. This also includes some libraries for building further api server extension points.
@ilkercelikyilmaz that's a huge improvement over what we had previously 🔥 ( @pm7h do you have those numbers on hand? I can't seem to find them). Once we also incorporate #600 - I wonder if we might be very close to what we might need to be. |
Yes, it's a huge improvement. Last I ran my load tests, it took over a minute for 100 allocations. You can see those results here: #412 (comment) |
I made couple changes after talking to Jarek (use Update instead of patch to prevent multiple allocations) and random gs selection from the top N (=20) available list to reduce the number of collisions |
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed. This also includes some libraries for building further api server extension points.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed. This also includes some libraries for building further api server extension points.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed. This also includes some libraries for building further api server extension points.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed. This also includes some libraries for building further api server extension points.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed. This also includes some libraries for building further api server extension points.
This both cleans up the webhook component, and makes it easier to test, but also sets us up to reuse the https server with the given cert pair -- which we will want to do as we work on googleforgames#536 and setup an api server extension which needs exactly the same self signed certificate setup.
This both cleans up the webhook component, and makes it easier to test, but also sets us up to reuse the https server with the given cert pair -- which we will want to do as we work on #536 and setup an api server extension which needs exactly the same self signed certificate setup.
[Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
[Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
[Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
[Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
[Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.
/cc @ilkercelikyilmaz @jkowalski how do we feel about closing this issue, given the performance we have now? |
I think this can be a good improvement but there is no urgency so we should keep it open. Not a blovker for 1.X though. |
Good call 👍 I've moved it off the next milestone, but leaving it open.
I did but hard to determine why that is happening - would be useful to have the performance testing suite in open source in some way, so we can all test things. Might be good to do CPU flame graph to see where the bottlenecks are. |
I ill try to check-in my load test in 0.11. |
I think this can be closed now! if you have objections, please say so, otehrwise I will close on Tuesday! |
No response! Closing! 😄 |
To get better throughput in GSA controller we could do batching: group together N allocation requests, assign GS to each of them and individually commit in parallel.
The text was updated successfully, but these errors were encountered: