Fix possible race condition in debounceCounts #162

JonCrowther · 2024-03-01T15:23:46Z

Some users are experiencing race conditions where apiserver causes a fatal error while marshalling an APIObject. It causes the Cluster Agent to restart.

Based on the investigation in rancher/rancher#43515, they identified the source as the Count type. Investigating the code, the debounceCounts function sends currentCount as an APIEvent every 5 seconds (via debounceDuration). However, when the count is modified (added or removed resource), it updates the map currentCount.Counts. I believe that if the JSON marshalling in APIServer takes long enough, debounceCounter will start modifying that map while it's being written, resulting in fatal error: concurrent map iteration and map write. To remedy this, I will be passing in a copied version of currentCount instead of passing it directly.

This change is completely theoretical. I can't reproduce the race condition, and I believe it requires a large list of resources that causes the JSON marshalling to take a long enough time for the count to change while it's marshalling.

see rancher/steve#162

puffitos · 2024-03-03T16:54:31Z

We'll test this tomorrow in our fork of steve and get back to you in a day or so after that (the crash typically happens once per day).

moio

I do not have a problem approving this once empirical evidence shows it helps. I'm keeping notifications for this PR and the issues and will change to approve as soon as that is confirmed.

JonCrowther · 2024-03-14T00:07:52Z

So after some trial and error, I found a way to consistently reproduce the panic locally. To do so, I did the following:

Reduce the debounceDuration to 5 milliseconds so it occurs more frequently
Open a page that has the necessary websocket to get count events. I chose the cluster page where it displays object count
I created a bash script that created configmaps on a loop and ran 4 instances of the script

It takes different amounts of time, but after around 6-10k configmaps I get the concurrent map iteration and map write consistently.

I tested the original changes I made for this PR where I attempted to do DeepCopy the count when it is sent to apiEvent

steve/pkg/resources/counts/buffer.go

Line 52 in 7913f27

result <- toAPIEvent(*currentCount)

This was still causing a panic, but now it was causing a panic during the DeepCopy, specifically during the DeepCopy of ItemCount.Namespaces. After some more investigating, I realized that the issue was that we were passing ItemCount directly into the Count channel we pass to debounceCounts.

steve/pkg/resources/counts/counts.go

Line 184 in 7913f27

schema.ID: itemCount,

I believe what happens is that we register multiple Watch for the Counts schema (I'm assuming 1 per websocket?) and when the OnChange function is called, it passes a pointer to the same ItemCount to 2 different channels. Then one edits the ItemCount.Namespaces map while the other is iterating through it.

My solution is to create a copy of ItemCount and pass that to the channel, as opposed to passing a pointer to the original object. That way each channel receives a completely separate object that it can modify without negatively impacting other channels. I've run this 5-10 times and created 40k+ configmaps without getting any error, whereas before I consistently got a panic after ~7k configmaps.

moio

Approved.

#162 (comment)

sounds convincing enough to me

Fix race condition in debounceCounts

JonCrowther requested review from moio, tomleb and MbolotSuse March 1, 2024 15:23

JonCrowther self-assigned this Mar 1, 2024

DeepCopy the count before sending it as APIEvent

0e74495

JonCrowther force-pushed the count-race-condition branch from 3bd6171 to 0e74495 Compare March 1, 2024 15:50

Implement custom DeepCopy for Count

cec59c5

eumel8 added a commit to caas-team/steve that referenced this pull request Mar 1, 2024

fix possible race condition in debounceCounts

d1fb3a0

see rancher/steve#162

JonCrowther mentioned this pull request Mar 6, 2024

[BUG] Cluster Agent restarts because of concurrent map write rancher/rancher#43515

Closed

moio reviewed Mar 12, 2024

View reviewed changes

Change which object needs DeepCopy

abbe805

moio approved these changes Mar 14, 2024

View reviewed changes

tomleb approved these changes Mar 14, 2024

View reviewed changes

JonCrowther merged commit ca29f47 into rancher:master Mar 14, 2024
1 check passed

JonCrowther deleted the count-race-condition branch March 14, 2024 14:52

JonCrowther mentioned this pull request Mar 14, 2024

Bump steve dependency rancher/rancher#44785

Merged

rmweir mentioned this pull request Mar 21, 2024

[BUG] Flaky steve test or race condition still exists rancher/rancher#44873

Closed

aruiz14 pushed a commit to aruiz14/steve that referenced this pull request Mar 22, 2024

Merge pull request rancher#162 from JonCrowther/count-race-condition

82d599c

Fix race condition in debounceCounts

JonCrowther mentioned this pull request Apr 11, 2024

[2.8 Backport] Fix data race #190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix possible race condition in debounceCounts #162

Fix possible race condition in debounceCounts #162

JonCrowther commented Mar 1, 2024

puffitos commented Mar 3, 2024

moio left a comment

JonCrowther commented Mar 14, 2024

moio left a comment

Fix possible race condition in debounceCounts #162

Fix possible race condition in debounceCounts #162

Conversation

JonCrowther commented Mar 1, 2024

puffitos commented Mar 3, 2024

moio left a comment

Choose a reason for hiding this comment

JonCrowther commented Mar 14, 2024

moio left a comment

Choose a reason for hiding this comment