-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce deadlock in controller #2349
Conversation
🌐 Coverage report
|
@michalpristas does it relate any existing issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of open questions but in general I feel like we need to refactor a bit the interaction among the goroutines.
We are using channels for notification but we pull the data from the provider directly and that makes the code we have to write awkward in places.
As for the PR, I didn't see any major flaw/bug but before approving I just need for someone to check that we are not changing some side effect we rely on by not synchronizing readers and writers on a channel
|
||
// wait for all providers to stop (but its possible they still send notifications over notify | ||
// channel, and we cannot block them sending) | ||
emptyChan, emptyCancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure what's the purpose of having a cancellable context when we call cancel upon leaving the function scope...
Is it to try and consume as many notifications as possible just up to the last possible moment once the providers signal that they are done?
If that is the idea, wouldn't using a simple channel to stop the consuming goroutine suffice instead of creating a context ?
(It was already present in the previous version so it's not critical for the fix, just food for thought)
default: | ||
// c.ch is size of 1, nothing is reading and there's already a signal | ||
select { | ||
case <-c.ch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a strange pattern: we are consuming from the same channel we are producing on because we didn't manage to write another value.
I understand why it's here but this shows that we have some big sync issues if we have to solve them this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with refactoring option as a followup. i'd rather have notifications in a form of .net events where producer does not need to consider readers.
as we talked about, i wanted this PR to be as least intrusive as possible so we can spend proper time with refactor without time pressure from production issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create an issue for the refactor so we don't forget to do it please :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the possibility that this results in a vars set being lost? Confused on why this is really needed? There should always be a read and if there is not a reader the controller should really block to ensure that all vars are read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea behind is that when controller is handling something else this will become blocking,
when this is blocked no new vars are being processed and stored in mappings because we have a loop stuck in here. so this helps controller to get most recent mappings based on actual events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just worry that if we empty the channel like it is here, then the reader of the channel starts reading again the set of variables will be removed and then it will not get the variables from the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not happen when we drain we dont cancel a loop and will push vars again in next cycle
@@ -256,7 +281,8 @@ func (c *contextProviderState) Set(mapping map[string]interface{}) error { | |||
return nil | |||
} | |||
c.mapping = mapping | |||
c.signal <- true | |||
|
|||
go func() { c.signal <- true }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we try to write multiple times to c.signal
as well ? Or there are cases when there's no reader ready?
Is this the reason we write into the channel from a goroutine?
What about returning without this synchronization point? Are there any side effects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes both can be true, but multiple writes mostly, this is triggered on event OnAdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of moving this into a go function? I would prefer not to create a go function on every change that will allocate a 4k stack for every little event with this change.
What are other options that we could do in this path? What is the real reason for this change?
Would the following work without creating a go routine:
select {
case c.signal <- true:
default:
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving this to go function will make sure signal is not lost somewhere
when you have a controller handling other stuff like a config writing to signal will fail and signal will be lost.
this way signal is not lost and goroutines are quickly destroyed when draining a channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reason for this is that when notify is not read right away this will be hanging. with this hanging lock i not released and even OnAdd
event receiver will be stuck on acquiring a lock.
notify is read in a different loop and is dependent on coordinator reading a value. so slow coordinator will make this stuck and create a lot of OnAdd hanging handlers waiting for locks
i can imagine having large number of OnAdd handlers hanging to produce out of order events as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If c.signal
has a buffer size of 1 then we can ensure that it have a true sitting on the channel, which is all we need. So having the following:
select {
case c.signal <- true:
default:
}
Will ensure that at least one true is on the channel, and if another is already there then it will fallthrough to the default and continue on. This removes the need to have a goroutine.
You need a changelog entry https://github.com/elastic/elastic-agent#changelog |
default: | ||
// c.ch is size of 1, nothing is reading and there's already a signal | ||
select { | ||
case <-c.ch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the possibility that this results in a vars set being lost? Confused on why this is really needed? There should always be a read and if there is not a reader the controller should really block to ensure that all vars are read.
@@ -256,7 +281,8 @@ func (c *contextProviderState) Set(mapping map[string]interface{}) error { | |||
return nil | |||
} | |||
c.mapping = mapping | |||
c.signal <- true | |||
|
|||
go func() { c.signal <- true }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of moving this into a go function? I would prefer not to create a go function on every change that will allocate a 4k stack for every little event with this change.
What are other options that we could do in this path? What is the real reason for this change?
Would the following work without creating a go routine:
select {
case c.signal <- true:
default:
}
@@ -317,7 +343,8 @@ func (c *dynamicProviderState) AddOrUpdate(id string, priority int, mapping map[ | |||
mapping: mapping, | |||
processors: processors, | |||
} | |||
c.signal <- true | |||
|
|||
go func() { c.signal <- true }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -329,32 +356,37 @@ func (c *dynamicProviderState) Remove(id string) { | |||
if exists { | |||
// existed; remove and signal | |||
delete(c.mappings, id) | |||
c.signal <- true | |||
|
|||
go func() { c.signal <- true }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
created an issue for refactor #2357 |
changelog/fragments/1678178226-Reduced-deadlocks-in-controller.yaml
Outdated
Show resolved
Hide resolved
…astic-agent into fix/deadlock-controller
Is there any way we could reproduce the deadlock here in a unit test? The lack of test coverage here is part of the reason why we have this bug most likely. |
@blakerouse can you do another review pass today? We really want this fix to be in 8.7. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and thanks for the changes. I prefer the fallthrough of the select instead of the spawning of a new goroutine on every notify, thanks!
I am not very familiar with the composable controller yet, so mostly I can determine the code is very complex and difficult to test. We will follow up on improving that separately. Since we've confirmed that without this fix the agent can deadlock on k8s, let's go ahead and merge it. I'd rather have targeted tests, but at this point in the release cycle we are better served merging this and letting it soak test in the snapshot builds. |
* Reduced deadlock in controller on fast k8s events * changelog * Update 1678178226-Reduced-deadlocks-in-controller.yaml * do not signal in a goroutine (cherry picked from commit 6664441)
* Reduced deadlock in controller on fast k8s events * changelog * Update 1678178226-Reduced-deadlocks-in-controller.yaml * do not signal in a goroutine (cherry picked from commit 6664441)
@cmacknz which version we can use for this merged feature? 8.7.0? |
Yes this fix was released in v8.7.0. The merge commit Reduce deadlock in controller is present in the v8.7.0 tag https://github.com/elastic/elastic-agent/commits/v8.7.0 |
What does this PR do?
Modifies synchronization and debounce mechanism in controller.
When events are flying in a fast pace, debounce resets timer with each of them and eventually never hits
<- .C
It also changes notification to be a bit buffered. Notification is not blocking anymore so lock holded while notification is being performed can be released and other event processed faster.
Why is it important?
In k8s environments with more pods it can hit "deadlock" when agent becomes totally unresponsive
Checklist
./changelog/fragments
using the changelog toolTesting this
Please try to break it
Start cloud ES instance
Save this config:
elastic-agent-standalone-kubernetes.yaml
``` # For more information https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-standalone.html apiVersion: v1 kind: ConfigMap metadata: name: agent-node-datastreams namespace: kube-system labels: k8s-app: elastic-agent-standalone data: agent.yml: |- outputs: default: type: elasticsearch hosts: - >-Build cloud image and update
image: docker.elastic.co/beats/elastic-agent:8.6.3-SNAPSHOT
to your imageUpdate
ES_USERNAME
,ES_PASSWORD
andES_HOST
Update to your log path
Run
kubectl apply -f elastic-agent-standalone-kubernetes.yaml
Spin up some containers
for i in {1..50}; do kubectl run redis${i} --image=redis;done
Later on generate events:
Hook into agent pod
kubectl exec --stdin --tty elastic-agent-standalone-{ID}-n kube-system -- /bin/bash
Run
./elastic-agent inspect --variables --variables--wait=2s > /etc/elastic-agent/data/logs/inspect.conf
Check 64, 66 and 76 are part of config (they were added during event generation not initial config)
Issues
The other things i'm seeing here is related to filebeat
I see a lot of events like this:
e.g today 23251 event between (within 30s)
Closes: #2269