Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating UpdateList to update the values on a list #3899

Merged
merged 15 commits into from
Sep 10, 2024

Conversation

chrisfoster121
Copy link
Contributor

/kind bug
What this PR does / Why we need it:
This PR updates the UpdateList functionality to update the list values along side the given capacity. This is very important because without it managing game server keys gets very complicated and often causes unwanted race conditions when sending multiple additions or removals to lists in quick succession.

Which issue(s) this PR fixes:
Closes #3870

@github-actions github-actions bot added kind/bug These are bugs. size/S labels Jul 10, 2024
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: ac2c0d9e-1a0b-400a-972d-b7b1e11075ce

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3899/head:pr_3899 && git checkout pr_3899
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.42.0-dev-19f69d8-amd64

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: b956a631-428e-4fd6-af05-7cf4b8867f62

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3899/head:pr_3899 && git checkout pr_3899
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.42.0-dev-aa9f223-amd64

@chrisfoster121
Copy link
Contributor Author

@igooch This is a duplicate of the previous pull request that I made and fixes the CLA issue. What is the process from here to get this approved?

@igooch
Copy link
Collaborator

igooch commented Jul 10, 2024

I pulled down your PR and installed on a cluster. It doesn't look like the code is working as intended.

me@me:~/agones/build$ curl -H "Content-Type: application/json" -X GET http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"players","capacity":"10","values":["player1","player2","player3"]}
me@me:~/agones/build$ curl -d '{"capacity": "120", "values": ["player3", "player4"]}' -H "Content-Type: application/json" -X PATCH http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
me@me:~/agones/build$ curl -H "Content-Type: application/json" -X GET http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"players","capacity":"120","values":["player1","player2","player3"]}
me@me:~/agones/build$ curl -d '{"values": ["player5"]}' -H "Content-Type: application/json" -X PATCH http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
me@me:~/agones/build$ curl -H "Content-Type: application/json" -X GET http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"players","capacity":"0","values":[]}
me@me:~/agones/build$ k describe gs
…
Status:
  …
  Lists:
    Players:
      Capacity:  0
      Values:
…

The UpdateList should make use of the field masks for specifying which field(s) are being updated. More info on that in the PR that was closed #3897 (comment) .

Please also include tests for the new behavior.

@chrisfoster121
Copy link
Contributor Author

Hmm that is strange as it worked locally for me. To test this I am using the following for my gameserver.yaml:

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-game-server-fleet
spec:
  replicas: 1
  template:
    spec:
      sdkServer:
        logLevel: Debug
      lists:
        Players:
          capacity: 0
          values:
      container: simple-game-server
      template:
        spec:
          containers:
            - name: simple-game-server
              image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.31
            - name: alpine   
              image: alpine/curl 
              command: ["sleep"]         # Override the container entrypoint with a simple sleep command
              args: ["1000000"]  

One thing that is important to note is that when running kubectl describe gs I see the following:

Spec:
  Lists:
    Players:
      Capacity:  0
      Values:
    Players:
      Capacity:  3
      Values:
        player1
        player2
        player3

And:

Status:
  Lists:
    Players:
      Capacity:  0
      Values:
    Players:
      Capacity:  1
      Values:
        player5

I think this is due to there already being a predefined list named players and when describing the pod it capitalizes the output for attributes.

Anyways, to test I copied your commands into my shell and ran them to get the following output.

/ # curl -H "Content-Type: application/json" -X GET http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"players","capacity":"3","values":["player1","player2","player3"]}/ #
/ #
/ # curl -d '{"capacity": "120", "values": ["player3", "player4"]}' -H "Content-Type: application/json" -X PATCH http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"","capacity":"0","values":[]}/ #
/ #
/ # curl -H "Content-Type: application/json" -X GET http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"players","capacity":"120","values":["player3","player4"]}/ #
/ #
/ # curl -d '{"values": ["player5"]}' -H "Content-Type: application/json" -X PATCH http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"","capacity":"0","values":[]}/ #
/ #
/ # curl -H "Content-Type: application/json" -X GET http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"players","capacity":"0","values":[]}/ #
/ #
/ # curl -d '{"capacity":"1", "values": ["player5"]}' -H "Content-Type: application/json" -X PATCH http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"","capacity":"0","values":[]}/ #
/ #
/ # curl -H "Content-Type: application/json" -X GET http://localhost:${AGONES_SDK_HTTP_PORT}/v1beta1/lists/players
{"name":"players","capacity":"1","values":["player5"]}/ #

@igooch Would you be able to provide more detailed steps to reproduce the issue you are seeing? If I can start replicating this on my end it will be a lot easier to diagnose the issue.


The UpdateList should make use of the field masks for specifying which field(s) are being updated. More info on that in the > PR that was closed #3897 (comment) .

I am a bit confused by that. Looking at both AddListValue and RemoveListvalue in sdkserver.go, neither of them use the field masks. They both queue the changes for being processed later in updateList. Are field masks specific to UpdateList? I went with that method because it matched the pattern that was already used. I can change this but I am just trying to understand why one way over the other.

@chrisfoster121
Copy link
Contributor Author

@igooch I just wanted to follow up to see if you had a chance to check this out. I was unable to repro your issue so if you can provide a bit more detailed repro steps that would be great, and then I'll hopefully have a better idea of what is going wrong!

@igooch
Copy link
Collaborator

igooch commented Jul 15, 2024

The seeing two Players keys for Status and Spec is bizarre -- Lists is a map, so the keys should be unique. The Spec spec should always remain the same as the original spec template, and the Status changes with the with the Update, Add, Remove, etc.

I'm also not sure what you mean by not being able to replicate? From what I can tell the code in the above comment is replicated exactly. The only difference being that there's an additional curl -d '{"capacity":"1", "values": ["player5"]}' -H "Content-Type: application/json" -X PATCH command at the end.

The reason the existing SDK Server doesn't currently use the protobuf field mask is because it's only changing one field, the capacity. Or in the case of Add or Remove it's only changing the values field. If multiple values can be changed, then that's when the field mask comes into use. Using the field mask allows for updates that only change specific fields, and leave the other fields untouched. So the command curl -d '{"capacity": "120"}' -H "Content-Type: application/json" -X PATCH should only change the capacity field, and curl -d '{"values": ["player5"]}' -H "Content-Type: application/json" -X PATCH should only change the values field. Right now if someone passes in the command curl -d '{"values": ["player5"]}' -H "Content-Type: application/json" -X PATCH, then it uses the default value for the UpdateListRequest capacity, which is 0. That's why the Get request after this command returns an empty list. Does that make sense?

@chrisfoster121
Copy link
Contributor Author

chrisfoster121 commented Jul 17, 2024

@igooch I have updated the PR to use the field mask system as requested. I have not gotten a chance to test this out yet and plan to do so in the next day or so.

However, I wanted to see if I could get some feedback on this as I ran into some issues with the unit tests. When running TestSDKServerUpdateList in sdkserver_test.go I ran into issues getting the game server data to patch correctly. Specifically, when calling s.patchGameServer(ctx, gs, gsCopy) in UpdateList it appears to call s.gameServerGetter.GameServers(s.namespace) which seems to return a FakeGameServer.

Is this expected for unit tests and if so is there something I need to change to allow the test case to pass correctly? In the mean time I added a line to make it directly update the game server to pass the tests, however, I don't think this should be submitted and would like to find the correct solution. Is this something you are familiar with?

I have also converted this PR to a draft until this is resolved.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: ac7d8248-4a33-4bc2-8c28-aed8a44f8330

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@chrisfoster121 chrisfoster121 marked this pull request as draft July 17, 2024 20:47
pkg/sdkserver/sdkserver.go Show resolved Hide resolved
list.Capacity = tmpList.Capacity
list.Values = tmpList.Values
gsCopy.Status.Lists[name] = list
s.patchGameServer(ctx, gs, gsCopy)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to call this here, all the calls are batched, and then the actual change to the gameserver happens in updatelist

gs, err = s.patchGameServer(ctx, gs, gsCopy)

@markmandel do you want to weigh in here? It does seem a bit odd to batch an "overwrite" function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to batching for a few reasons:

  1. It puts work items in the workerqueue - which for an "overwrite" operation isn't necessary for conflicts, but is necessary if the KCP goes down for any reason -- so we stay self-healing.
  2. The experience across the SDK stays the same regardless of operation.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 3b9b1854-9f92-4f6a-bf5f-3a04849188e9

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 858c9972-5509-4d5f-83d6-9dde95825719

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 73a0e49f-c29b-438f-b086-2261905ba356

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@github-actions github-actions bot added the size/M label Aug 9, 2024
@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: 094d510f-5e3c-4429-87bc-552f22a1dbda

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3899/head:pr_3899 && git checkout pr_3899
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.43.0-dev-10dc311

@chrisfoster121 chrisfoster121 marked this pull request as ready for review August 9, 2024 16:16
@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: a82fbd60-8bcd-49b7-bb0e-91485829ca96

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 38b0b88c-0113-46c3-9243-294b6d3b2b84

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 6af3ddb0-ba08-4478-beda-3c6961d654bf

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@chrisfoster121
Copy link
Contributor Author

@igooch I appear to be failing on the submit-e2e-test-cloud-build step:

Step #25 - "submit-e2e-test-cloud-build": gke-autopilot-1.28: BUILD FAILURE: Build step failure: build step 1 "e2e-runner" failed: step exited with non-zero status: 2
Step #25 - "submit-e2e-test-cloud-build": gke-autopilot-1.28: ERROR: (gcloud.builds.submit) build 8050e9fa-86df-44f5-8c48-2a4a68048363 completed with status "FAILURE"
Step #25 - "submit-e2e-test-cloud-build": One of the e2e test child cloud build exited with nonzero status 1. Aborting.

I am not sure what this step is, would you be able to clarify this for me?

@igooch
Copy link
Collaborator

igooch commented Aug 9, 2024

@igooch I appear to be failing on the submit-e2e-test-cloud-build step:

Step #25 - "submit-e2e-test-cloud-build": gke-autopilot-1.28: BUILD FAILURE: Build step failure: build step 1 "e2e-runner" failed: step exited with non-zero status: 2
Step #25 - "submit-e2e-test-cloud-build": gke-autopilot-1.28: ERROR: (gcloud.builds.submit) build 8050e9fa-86df-44f5-8c48-2a4a68048363 completed with status "FAILURE"
Step #25 - "submit-e2e-test-cloud-build": One of the e2e test child cloud build exited with nonzero status 1. Aborting.

I am not sure what this step is, would you be able to clarify this for me?

Builds are currently broken -- @zmerlynn is investigating #3939. I'll try and take a look later today as well.

@chrisfoster121
Copy link
Contributor Author

@igooch I appear to be failing on the submit-e2e-test-cloud-build step:

Step #25 - "submit-e2e-test-cloud-build": gke-autopilot-1.28: BUILD FAILURE: Build step failure: build step 1 "e2e-runner" failed: step exited with non-zero status: 2
Step #25 - "submit-e2e-test-cloud-build": gke-autopilot-1.28: ERROR: (gcloud.builds.submit) build 8050e9fa-86df-44f5-8c48-2a4a68048363 completed with status "FAILURE"
Step #25 - "submit-e2e-test-cloud-build": One of the e2e test child cloud build exited with nonzero status 1. Aborting.

I am not sure what this step is, would you be able to clarify this for me?

Builds are currently broken -- @zmerlynn is investigating #3939. I'll try and take a look later today as well.

Ah I see, thank you for the heads up!

In the meantime, I do have some questions about some of the codebase as it relates to the batching/queueing system.

I may be mistaken but it seems that when a request to change a list comes in (Add, Remove, and now Update), the current value in the queue gets overwritten. SDKServer has a variable gsListUpdates map[string]listUpdateRequest which is a one to one mapping. Then in AddListValue, for example, we have the following code:

batchList := s.gsListUpdates[in.Name]
batchList.valuesToAppend = list.Values
s.gsListUpdates[in.Name] = batchList

In this case that value for how to update the list is getting overwritten. My worry is that this will create a scenario where sending too many requests at the same time will cause the system to overwrite requests. In fact, this was the initial reason I started looking into using UpdateList. I was running into a scenario where I would make too many AddListValue requests in a short amount of time resulting in some of them not getting processed. Is this intentional and if so are there plans for making the request wait to return until the list has been updated? Alternatively, are there plans to make this no longer overwrite previous requests?

@igooch
Copy link
Collaborator

igooch commented Aug 9, 2024

In the meantime, I do have some questions about some of the codebase as it relates to the batching/queueing system.

I may be mistaken but it seems that when a request to change a list comes in (Add, Remove, and now Update), the current value in the queue gets overwritten. SDKServer has a variable gsListUpdates map[string]listUpdateRequest which is a one to one mapping. Then in AddListValue, for example, we have the following code:

batchList := s.gsListUpdates[in.Name]
batchList.valuesToAppend = list.Values
s.gsListUpdates[in.Name] = batchList

In this case that value for how to update the list is getting overwritten. My worry is that this will create a scenario where sending too many requests at the same time will cause the system to overwrite requests. In fact, this was the initial reason I started looking into using UpdateList.

The AddListValue locks on the resource, so multiple requests can't write at the same time.

s.gsUpdateMutex.Lock()

It then appends the new value to the list, so that all values are written back.

list.Values = append(list.Values, in.Value)

So batching complicates the UpdateList behavior. If two UpdateLists come in at the same time we could either
a) overwrite everything, although any add or remove requests that come in after the update list request but before the batch would still modify the batch
b) append all unique UpdateList values to existing list of values
c) not batch update list requests (although as @markmandel mentioned this isn't preferred behavior)
d) something else.
Out of this list I'd lean towards the option b).

I was running into a scenario where I would make too many AddListValue requests in a short amount of time resulting in some of them not getting processed. Is this intentional and if so are there plans for making the request wait to return until the list has been updated? Alternatively, are there plans to make this no longer overwrite previous requests?

This is interesting, have you seen any error like error retrieving resource lock or other debug logs like out of range or already exists that indicate what requests are dropped?

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: a3bf987c-88a1-4ed5-9b9e-6e0e13965c0d

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3899/head:pr_3899 && git checkout pr_3899
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.44.0-dev-9476858

@chrisfoster121
Copy link
Contributor Author

@igooch I added a mutex lock and made minor adjustments to clean the code up a bit. Does this look correct?

Copy link
Collaborator

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, one quick adjustment to the mutex. Just to confirm on this, the UpdateList will overwrite everything, although any add or remove or other update requests that come in after the initial update list request but before the batch would still modify the batch. Will this behavior work for your use case?

pkg/sdkserver/sdkserver.go Show resolved Hide resolved
Comment on lines 1082 to 1083
s.gsUpdateMutex.RLock()
defer s.gsUpdateMutex.RUnlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're writing to s.gsListUpdates[list.Name] this should be a Lock() mutex, and will need to be moved to below the s.GetList since that method uses a RLock().

@chrisfoster121
Copy link
Contributor Author

Thanks for the update, one quick adjustment to the mutex. Just to confirm on this, the UpdateList will overwrite everything, although any add or remove or other update requests that come in after the initial update list request but before the batch would still modify the batch. Will this behavior work for your use case?

Yes at least for us, we were expecting the list to be overwritten by UpdateList calls which this should so. This allows us to completely change the list in a single call if needed.

I will update the code shortly. Thank you!

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: 0d4f09de-bcd7-4bd9-a791-8961e515e011

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3899/head:pr_3899 && git checkout pr_3899
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.44.0-dev-22ef3ce

Copy link
Collaborator

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank for you for your work and patience on this!

@igooch igooch merged commit 632a866 into googleforgames:main Sep 10, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beta: UpdateList using Agones 1.41 rest API doesn't update the data in the selected list
5 participants