-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix possible race condition in debounceCounts #162
Conversation
3bd6171
to
0e74495
Compare
We'll test this tomorrow in our fork of steve and get back to you in a day or so after that (the crash typically happens once per day). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have a problem approving this once empirical evidence shows it helps. I'm keeping notifications for this PR and the issues and will change to approve as soon as that is confirmed.
So after some trial and error, I found a way to consistently reproduce the panic locally. To do so, I did the following:
It takes different amounts of time, but after around 6-10k configmaps I get the I tested the original changes I made for this PR where I attempted to do steve/pkg/resources/counts/buffer.go Line 52 in 7913f27
This was still causing a panic, but now it was causing a panic during the steve/pkg/resources/counts/counts.go Line 184 in 7913f27
I believe what happens is that we register multiple My solution is to create a copy of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix race condition in debounceCounts
Issue: rancher/rancher#43515
Some users are experiencing race conditions where apiserver causes a fatal error while marshalling an APIObject. It causes the Cluster Agent to restart.
Based on the investigation in rancher/rancher#43515, they identified the source as the
Count
type. Investigating the code, thedebounceCounts
function sendscurrentCount
as anAPIEvent
every 5 seconds (viadebounceDuration
). However, when the count is modified (added or removed resource), it updates the mapcurrentCount.Counts
. I believe that if the JSON marshalling in APIServer takes long enough,debounceCounter
will start modifying that map while it's being written, resulting infatal error: concurrent map iteration and map write
. To remedy this, I will be passing in a copied version ofcurrentCount
instead of passing it directly.This change is completely theoretical. I can't reproduce the race condition, and I believe it requires a large list of resources that causes the JSON marshalling to take a long enough time for the count to change while it's marshalling.