Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible race condition in debounceCounts #162

Merged
merged 3 commits into from
Mar 14, 2024

Conversation

JonCrowther
Copy link
Contributor

Issue: rancher/rancher#43515

Some users are experiencing race conditions where apiserver causes a fatal error while marshalling an APIObject. It causes the Cluster Agent to restart.

Based on the investigation in rancher/rancher#43515, they identified the source as the Count type. Investigating the code, the debounceCounts function sends currentCount as an APIEvent every 5 seconds (via debounceDuration). However, when the count is modified (added or removed resource), it updates the map currentCount.Counts. I believe that if the JSON marshalling in APIServer takes long enough, debounceCounter will start modifying that map while it's being written, resulting in fatal error: concurrent map iteration and map write. To remedy this, I will be passing in a copied version of currentCount instead of passing it directly.

This change is completely theoretical. I can't reproduce the race condition, and I believe it requires a large list of resources that causes the JSON marshalling to take a long enough time for the count to change while it's marshalling.

@JonCrowther JonCrowther self-assigned this Mar 1, 2024
@JonCrowther JonCrowther force-pushed the count-race-condition branch from 3bd6171 to 0e74495 Compare March 1, 2024 15:50
eumel8 added a commit to caas-team/steve that referenced this pull request Mar 1, 2024
@puffitos
Copy link

puffitos commented Mar 3, 2024

We'll test this tomorrow in our fork of steve and get back to you in a day or so after that (the crash typically happens once per day).

Copy link
Contributor

@moio moio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have a problem approving this once empirical evidence shows it helps. I'm keeping notifications for this PR and the issues and will change to approve as soon as that is confirmed.

@JonCrowther
Copy link
Contributor Author

So after some trial and error, I found a way to consistently reproduce the panic locally. To do so, I did the following:

  • Reduce the debounceDuration to 5 milliseconds so it occurs more frequently
  • Open a page that has the necessary websocket to get count events. I chose the cluster page where it displays object count
  • I created a bash script that created configmaps on a loop and ran 4 instances of the script

It takes different amounts of time, but after around 6-10k configmaps I get the concurrent map iteration and map write consistently.


I tested the original changes I made for this PR where I attempted to do DeepCopy the count when it is sent to apiEvent

result <- toAPIEvent(*currentCount)

This was still causing a panic, but now it was causing a panic during the DeepCopy, specifically during the DeepCopy of ItemCount.Namespaces. After some more investigating, I realized that the issue was that we were passing ItemCount directly into the Count channel we pass to debounceCounts.

schema.ID: itemCount,

I believe what happens is that we register multiple Watch for the Counts schema (I'm assuming 1 per websocket?) and when the OnChange function is called, it passes a pointer to the same ItemCount to 2 different channels. Then one edits the ItemCount.Namespaces map while the other is iterating through it.

My solution is to create a copy of ItemCount and pass that to the channel, as opposed to passing a pointer to the original object. That way each channel receives a completely separate object that it can modify without negatively impacting other channels. I've run this 5-10 times and created 40k+ configmaps without getting any error, whereas before I consistently got a panic after ~7k configmaps.

Copy link
Contributor

@moio moio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

#162 (comment)

sounds convincing enough to me

@JonCrowther JonCrowther merged commit ca29f47 into rancher:master Mar 14, 2024
1 check passed
@JonCrowther JonCrowther deleted the count-race-condition branch March 14, 2024 14:52
aruiz14 pushed a commit to aruiz14/steve that referenced this pull request Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants