-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow fleet autoscaler buffersize to be 0 #1782
Comments
Question on this: Do you want a buffersize of 0? Which probably are different problems 😄 Basically, are we talking about scaling to zero at this point? The TL;DR either way is - I don't think it's worth the extra complexity to autoscale to 0, since there are many, many complex edge cases, and you are going to have at least 1 node running in your cluster regardless - so having just 1 game server in there is not going to cost you anything extra infrastructure wise. But would love to hear more details! |
I have a use case around this 😄 TL;DR; want a fleet per engineer that regularly test their branch against a server build. I would like to be able to spin up an arbitary number of fleets for internal testing purposes, game devs want to be able to spin up the servers for their own branch and not effect other people. So this would be the latter I want replicas to be able to be 0 and therefore probably only want to start buffering when at least one instance is up. For us this currently means that I am running a fleet per engineer who regularly tests their branch so we do not need to buy or setup hardware. Currently means that there is a fair few extra nodes running on the machines that are just sat in |
The original use case I heard was also for development. I'm wondering if there is a different way to tackle this other than changing the fleet autoscaler.... You can currently create an arbitrary number of fleets set to 0 replicas without an autoscaler. So the question becomes what changes the size from 0 -> 1 and back to 0. The first answer seems to be to try and make the fleetautoscaler do it, but it's also something you could drive from your CI system as well. Would something like this work:
Depending on what the devs are doing (maybe they need a variable number of game servers) step 2 could be to insert a fleet autoscaler and step 3 could be to remove it and set the replicas back to zero. This would mean that as long as the developer is actively pushing changes they would have a game server to test. And if they aren't then it gets reaped automatically. |
☝️ I like this idea. This goes back to my original question:
And if you want min replicas of 0 -- what tells the system, "Hey, I'd like a Ready GameServer now, so I can do an allocation shortly" ? I think @roberthbailey 's strategy above is a good one. Maybe even tie it into your dev matchmaker somehow? |
Given that this has been stagnant for a long time, I'm going to close it as "won't implement" (at least for now). We can always re-open to continue the discussion if there is anything more to add later. |
Hi! Could we consider re-opening this discussion?
This is exactly my use case. I'm setting up Agones for my open-source pet project, and I would definitely like to cut infrastructure costs while the game is in active development, and a player may appear maybe once a month. Thank you. |
I'm happy to re-open, but I don't know if this is something we will be able to prioritize soon. |
I'll repeat my original question:
Which then leads into the following questions that came after that. Without answers to those questions, I'm not sure what more we can do here to automate this. One solution several people have done is use the webhook autoscaler and coordinate that with your dev matchmaker to size up Fleets as needed based on the needs of the development system, since your system actually knows if you need new GameServers to scale up from 0 and Agones has no idea. |
I believe I want both min replicas and buffersize set to 0.
That's an interesting idea. I'll dive into its documentation deeper, maybe it's indeed something I could use as a solution.
Won't creating new game server allocations give Agones the idea to scale up the fleet? I was thinking about coding my matchmaker service the way so that it would ask Kubernetes API to create new allocations. |
But we can't guarantee that a game server will spin up before the allocation request timed out (I've lost track if it's 30s or a minute) - which is locked from the K8s API. |
Would be nice to have this:
I have any fleets, most of these sit still. Only tested from to time. |
@dzmitry-lahoda what would that do exactly? leave the Fleet at 0? And then what happens on allocation? I'm assuming nothing. At which point, I'm wondering what is the point of the autoscaler at all? 🤔 |
i see that fleet needs to run at least 1 hot server. our main gs do run at least one. but testing and debug gs are launched from time to time, not often, so not sure if these need 1 hot. why to use fleet? to reuse same mm and devops flows for these gs. if allocation is requested and not gs ready, and not limit reached, may launch one gs. fine for first allocation request to timeout. system will ask once more in loop. if gs did not became allocated, but stuck i ready, shutdown it after timeout. assuming 2x of allocation timout time. i would not like to operate fleet via api. developing dev ops flow which depending on some activity scales to one and back seems complicated. |
So ultimately, you are asking for scale to zero with Fleet auto scaling, which I'm not against, but is a fair bit of a nightmare to handle all the edge cases. I also don't think we can do a "scale to zero, but only for development". As soon as it exists, it needs to work for production and development at all times. What (I think I understood) from the above, probably works for you and your game, but may not work for everyone, so it requires a pretty thorough design, with consideration for all the race conditions that can occur. This is also noting that this system is not (mostly) imperative. It's a declarative, self-healing system with a set of decoupled systems working in concert - so it's a little bit trickier than saying "on allocation, just spin up a game server". Who has that responsibility? Is the allocator service now changing replicas in the Fleet? (which is otherwise the autoscaler's responsibility). What happens when the autoscaler collides with the allocation creating a new GameServer and removes it? Maybe we should change the min buffer on the autoscaler? But then, what tells it to scale back down? Uuurg. it gets very messy very quickly.
I am amused by this 😁 yes, this is complicated, that's why we've never really done it. But I'd love hear if people have detailed designs in mind that covers both scale up and scale down that cover all the integrated components 👍🏻 |
my naive attempt code designspec state of fleet becames sum type. enum FleetSize {
YamlSpec(buffer, replicas),
LiftedSpec(YamlSpec, least_timeout)
} fleet is 1
fleet is 0
Ready
Allocated
fleet is 1, but yaml speced to 0
what is spec was changed to 0, and allocation came along to change to LiftedSpec to 1
fleet is 1+ always
concerns
would collision be the same as if i change yaml manually to +1 and then -1 and than + 1?
LiftedSpec lease timeout. LeastTimeout should be at least 2x of allocation timeout. I would prefer it can be dynamically set via k8s API of A. production
So in production I can than provide least to grow buffer from 13 to 42 if LeasedAllocatorFunction tells to do so. Default LeasedAllocatorFunction would be ingnorant about allocation requests. With option to swap with allocator fuction name, like can be some simple linear interpolation from last 3 minutes of allocation (time needed to swawn new VM and docker). Not sure what is hook in Agones to do buffer customization except chaging spec via YAML, but again - it would be nice some algorithm build in - it may be similar tech as used for near zero. |
For the use case of development, we thought about doing something like this but we actually settled on an internal command (in our case its in game, but you could do a slack command/something else all the same) that creates a one off |
yeah, can create if in code like, |
i would either have users who are responsible (have access to cluster and can allocated as please, and then clean) and others, who goes only via formalism of fleet setup. |
@markmandel oh, I didn't know about that. Is there a way to track allocations? I noticed that they always spawn a gameserver in case there wasn't available one, but the response doesn't always include a server name (I believe it happens if a game server didn't exist and it's a newly created one). So I believe it makes it impossible to reliably correlate allocations to game servers and track their status in this case. Also, what makes a game allocation time out? My current understanding is that it will time out if a game server doesn't get promoted to And the last question I have: if an allocation request times out, will a game server get deleted as well? Or will it remain in If a game server allocation always responded with a game server name and game servers outlived timed out allocations, I believe the problem you've mentioned could be mitigated by letting clients wait for game servers to finally boot up and re-sending game allocations if needed. |
So I'm struggling with this for a variety of reasons:
The more I think about this too, I don't think there should be blocking operations in an allocation. We already have a limited set of retries, but allocation tends to be a hot path, and I'm not comfortable putting blockers in it's way. To make things even MORE complicated: an Allocation is not tied to a Fleet. It has a set of preferential selectors, which can easily be cross Fleet, based on arbitrary labels, or used with singular GameServers -- so we can't even tie a Fleet spec/replicas/autoscalers to an Allocation either. Again, I come back to webhooks. Only you know how to scale to 0 based on your matchmaking criteria (especially if you are scaling your nodes to 0, and need to account for node scale up time), will likely want to do it well before allocation happens, and you are the one that knows which Fleets to scale and when. I don't think Agones can do that for you. |
unsafe, but can coded as pattern. typical sum type from go is return of error and result value, usually when error than there is no result.
sure, there should not be. but there could be extra speculative heuristic to overallocate?
so allocation and warm up from zero several fleets? may fine if that was requested which covers several fleets by selectors?
I can do that, but not sure what MM should know. Example, I have VM with 4 GS Allocated.There is room for 5, which is Ready, with buffer of one. So right after that 5th allocated, new VM should warm up. Imagine that buffer of VMs is zero. So allocation request will fail in loop(rate limited client requests, with open match amid user and allocator), until VM will boot up. By induction this may happen with any buffer size. No blocking allocation with non infinite buffer of everything do fail and timeout regardless of fleet size. Need to account for node scale up time. But yeah, it is more clear on options which can be used to imitate zero size fleet. Another option. Fleet with CPU limit of near zero. In this case fleet will never be able to run instance. When AR comes via MM (before allocated call to Agones), call K8S to increase CPU limit and after timeout reduce it back. Store timeout, time, and limits in Fleet annotations, so can always set these right. But these requires write access to K8S :( So will need to isolate that with RBAC and even namespace. |
my 2p ... I think this isn't an agones and will likley add more complexity and confusion for the majority of users ... it is more of a workflow/testing issue that could be different for all studios. Much like @theminecoder our studio has got around this issue using a workflow that will easily spin up and down fleets/game servers via CI/CD this achieves the same goal and keeps it out of the Agones codebase. Even running UE4 game servers does not give very much additional cost or complexity, there is some logic that means (the majority of) dev servers have all maps loaded rather than requiring server travel to different containers but this is adequate for them and can easily be altered if server travel needs to be tested. |
@domgreen so everybody solves same problem, builds work around. why would not it should make into agones? workarounds create additional attack vector. what if i would want to allow to run test GS on live env. having |
I wanted to share my results of building a webhook fleet autoscaler, as @markmandel has suggested in one of the previous posts. The implementation turned out to be fairly simple. In my setup, I have a matchmaker service, which clients connect to via WebSocket when they want to find a server to play. In order to track servers, my matchmaker subscribes to a kubernetes namespace and watches
I use the assumption that So, basically, the only introduced complexity over the initially discussed variant (when matchmaker just calls Agones/Kubernetes API to create an allocation) is spinning up an HTTP listener, that would respond with the information about desired replicas count, instead of making API calls by yourself. With having all the mentioned problems unsolved (allocation retries, blocking, etc), the webhook indeed sounds like a more reliable solution, and it allows for more flexibility. If anyone's interested in a ready recipe for a matchmaker service written in Rust, I can share my example:
Disclaimer: my DevOps knowledge is quite limited, so read the config with caution if you decide to take inspiration from it. I can't promise the setup is effective and secure enough. :) Another important note is that my current setup assumes only 1 replica for the matchmaker service. Fixing this limitation would require a more complex solution that would support sharing the state between replicas. Otherwise, different replicas will respond with different numbers of active WebSocket subscribers, and that can affect the desired fleet replica count. I hope this helps. |
Looking at the https://agones.dev/site/docs/reference/agones_crd_api_reference/#agones.dev/v1.FleetStatus |
@markmandel I didn't check it tbh, but I don't have any reason to believe it doesn't work. My use-case requires watching game servers anyway, as I want to list their names, IP addresses and player count. |
Adding another small use case to this: this could be a minor convenience when manually deploying servers. E.g. you have a playtest one morning and deploy a Fleet + Fleet Autoscaler for it. You know that you'll be testing again tomorrow with the same build but you don't want to leave the servers up for a day (save money, or to block access). Instead of tearing down the fleet you just scale to 0, then scale back up the next day- you don't have to keep the Fleet/Fleet Autoscaler configuration handy for the second day, particularly useful if it's done by a different person. The answer to
in this case is a manual operator. |
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions ' |
This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions |
Bumping to unstale |
We are closing this as there was no activity in this issue for last 90 days. Please reopen if you’d like to discuss anything further. |
Manually re-opened, since this looks like a bug in the bot. |
Bug example: googleforgames#1782 Shouldn't have closed, but instead was. I assume because it was still labeled as "obsolete", and it had been 30 days - but only the stale label had been removed. I added `labels-to-remove-when-unstale` to all the operations, since I'm not sure which operation does the work (and figured it couldn't hurt).
Bug example: googleforgames#1782 Shouldn't have closed, but instead was. I assume because it was still labeled as "obsolete", and it had been 30 days - but only the stale label had been removed. I added `labels-to-remove-when-unstale` to all the operations, since I'm not sure which operation does the work (and figured it couldn't hurt).
Bug example: #1782 Shouldn't have closed, but instead was. I assume because it was still labeled as "obsolete", and it had been 30 days - but only the stale label had been removed. I added `labels-to-remove-when-unstale` to all the operations, since I'm not sure which operation does the work (and figured it couldn't hurt).
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions ' |
This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions |
Is your feature request related to a problem? Please describe.
Allow the fleet autoscaler to have a buffersize of 0. During development, it is a cost savings to not be running idle game servers and the startup latency of spinning up a new one is not an issue.
Describe the solution you'd like
Change the validation here to allow the buffer size to be 0.
Describe alternatives you've considered
Leave it the way it is now.
Additional context
n/a
The text was updated successfully, but these errors were encountered: