Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce deadlock in controller #2349

Merged
merged 7 commits into from
Mar 10, 2023

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Mar 6, 2023

What does this PR do?

Modifies synchronization and debounce mechanism in controller.
When events are flying in a fast pace, debounce resets timer with each of them and eventually never hits <- .C
It also changes notification to be a bit buffered. Notification is not blocking anymore so lock holded while notification is being performed can be released and other event processed faster.

Why is it important?

In k8s environments with more pods it can hit "deadlock" when agent becomes totally unresponsive

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Testing this

Please try to break it

Start cloud ES instance
Save this config:

elastic-agent-standalone-kubernetes.yaml ``` # For more information https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-standalone.html apiVersion: v1 kind: ConfigMap metadata: name: agent-node-datastreams namespace: kube-system labels: k8s-app: elastic-agent-standalone data: agent.yml: |- outputs: default: type: elasticsearch hosts: - >- ${ES_HOST} username: ${ES_USERNAME} password: ${ES_PASSWORD} agent: logging.level: info monitoring: enabled: true use_output: default logs: true metrics: false inputs: - name: container-log id: container-log-1-${kubernetes.pod.name}-${kubernetes.container.id} type: filestream use_output: default meta: package: name: kubernetes version: 1.9.0 data_stream: namespace: default streams: - data_stream: dataset: kubernetes.container_logs type: logs condition: '${kubernetes.container.image} == "redis"' prospector.scanner.symlinks: true parsers: - container: ~ paths: - /var/log/containers/*${kubernetes.container.id}.log --- # For more information refer https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-standalone.html apiVersion: apps/v1 kind: DaemonSet metadata: name: elastic-agent-standalone namespace: kube-system labels: app: elastic-agent-standalone spec: selector: matchLabels: app: elastic-agent-standalone template: metadata: labels: app: elastic-agent-standalone spec: # Tolerations are needed to run Elastic Agent on Kubernetes control-plane nodes. # Agents running on control-plane nodes collect metrics from the control plane components (scheduler, controller manager) of Kubernetes tolerations: - key: node-role.kubernetes.io/control-plane effect: NoSchedule - key: node-role.kubernetes.io/master effect: NoSchedule serviceAccountName: elastic-agent-standalone hostNetwork: true dnsPolicy: ClusterFirstWithHostNet # Uncomment if using hints feature #initContainers: # - name: k8s-templates-downloader # image: busybox:1.28 # command: ['sh'] # args: # - -c # - >- # mkdir -p /etc/elastic-agent/inputs.d && # wget -O - https://github.com/elastic/elastic-agent/archive/8.6.tar.gz | tar xz -C /etc/elastic-agent/inputs.d --strip=5 "elastic-agent-main/deploy/kubernetes/elastic-agent-standalone/templates.d" # volumeMounts: # - name: external-inputs # mountPath: /etc/elastic-agent/inputs.d containers: - name: elastic-agent-standalone image: docker.elastic.co/beats/elastic-agent:8.6.3-SNAPSHOT # docker.elastic.co/beats/elastic-agent:8.6.2 args: [ "-c", "/etc/elastic-agent/agent.yml", "-e", ] env: # The basic authentication username used to connect to Elasticsearch # This user needs the privileges required to publish events to Elasticsearch. - name: ES_USERNAME value: "elastic" # The basic authentication password used to connect to Elasticsearch - name: ES_PASSWORD value: "redacted" # The Elasticsearch host to communicate with - name: ES_HOST value: "redacted:443" - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: STATE_PATH value: "/etc/elastic-agent" securityContext: runAsUser: 0 resources: limits: memory: 700Mi requests: cpu: 100m memory: 400Mi volumeMounts: - name: datastreams mountPath: /etc/elastic-agent/agent.yml readOnly: true subPath: agent.yml - name: agentlogs mountPath: /etc/elastic-agent/data/logs # Uncomment if using hints feature #- name: external-inputs # mountPath: /etc/elastic-agent/inputs.d - name: proc mountPath: /hostfs/proc readOnly: true - name: cgroup mountPath: /hostfs/sys/fs/cgroup readOnly: true - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true - name: varlog mountPath: /var/log readOnly: true - name: etc-full mountPath: /hostfs/etc readOnly: true - name: var-lib mountPath: /hostfs/var/lib readOnly: true volumes: - name: datastreams configMap: defaultMode: 0640 name: agent-node-datastreams # Uncomment if using hints feature #- name: external-inputs # emptyDir: {} - name: agentlogs hostPath: path: /Users/michal/k8slogs - name: proc hostPath: path: /proc - name: cgroup hostPath: path: /sys/fs/cgroup - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers - name: varlog hostPath: path: /var/log # The following volumes are needed for Cloud Security Posture integration (cloudbeat) # If you are not using this integration, then these volumes and the corresponding # mounts can be removed. - name: etc-full hostPath: path: /etc - name: var-lib hostPath: path: /var/lib --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: elastic-agent-standalone subjects: - kind: ServiceAccount name: elastic-agent-standalone namespace: kube-system roleRef: kind: ClusterRole name: elastic-agent-standalone apiGroup: rbac.authorization.k8s.io --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: namespace: kube-system name: elastic-agent-standalone subjects: - kind: ServiceAccount name: elastic-agent-standalone namespace: kube-system roleRef: kind: Role name: elastic-agent-standalone apiGroup: rbac.authorization.k8s.io --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: elastic-agent-standalone-kubeadm-config namespace: kube-system subjects: - kind: ServiceAccount name: elastic-agent-standalone namespace: kube-system roleRef: kind: Role name: elastic-agent-standalone-kubeadm-config apiGroup: rbac.authorization.k8s.io --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: elastic-agent-standalone labels: k8s-app: elastic-agent-standalone rules: - apiGroups: [""] resources: - nodes - namespaces - events - pods - services - configmaps # Needed for cloudbeat - serviceaccounts - persistentvolumes - persistentvolumeclaims verbs: ["get", "list", "watch"] # Enable this rule only if planing to use kubernetes_secrets provider #- apiGroups: [""] # resources: # - secrets # verbs: ["get"] - apiGroups: ["extensions"] resources: - replicasets verbs: ["get", "list", "watch"] - apiGroups: ["apps"] resources: - statefulsets - deployments - replicasets - daemonsets verbs: ["get", "list", "watch"] - apiGroups: ["batch"] resources: - jobs - cronjobs verbs: ["get", "list", "watch"] - apiGroups: - "" resources: - nodes/stats verbs: - get # Needed for apiserver - nonResourceURLs: - "/metrics" verbs: - get # Needed for cloudbeat - apiGroups: ["rbac.authorization.k8s.io"] resources: - clusterrolebindings - clusterroles - rolebindings - roles verbs: ["get", "list", "watch"] # Needed for cloudbeat - apiGroups: ["policy"] resources: - podsecuritypolicies verbs: ["get", "list", "watch"] - apiGroups: [ "storage.k8s.io" ] resources: - storageclasses verbs: [ "get", "list", "watch" ] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: elastic-agent-standalone # Should be the namespace where elastic-agent is running namespace: kube-system labels: k8s-app: elastic-agent-standalone rules: - apiGroups: - coordination.k8s.io resources: - leases verbs: ["get", "create", "update"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: elastic-agent-standalone-kubeadm-config namespace: kube-system labels: k8s-app: elastic-agent-standalone rules: - apiGroups: [""] resources: - configmaps resourceNames: - kubeadm-config verbs: ["get"] --- apiVersion: v1 kind: ServiceAccount metadata: name: elastic-agent-standalone namespace: kube-system labels: k8s-app: elastic-agent-standalone --- ```

Build cloud image and update image: docker.elastic.co/beats/elastic-agent:8.6.3-SNAPSHOT to your image
Update ES_USERNAME, ES_PASSWORD and ES_HOST
Update to your log path

        - name: agentlogs
          hostPath:
            path: /Users/{USER}/k8slogs

Run kubectl apply -f elastic-agent-standalone-kubernetes.yaml
Spin up some containers for i in {1..50}; do kubectl run redis${i} --image=redis;done
Later on generate events:

ID=2;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=3;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=8;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=22;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=23;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=14;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=16;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=32;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=33;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=38;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=11;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=66;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=76;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3
ID=64;kubectl delete pod redis$ID;kubectl run redis$ID --image=redis; sleep .3

Hook into agent pod kubectl exec --stdin --tty elastic-agent-standalone-{ID}-n kube-system -- /bin/bash
Run ./elastic-agent inspect --variables --variables--wait=2s > /etc/elastic-agent/data/logs/inspect.conf
Check 64, 66 and 76 are part of config (they were added during event generation not initial config)

Issues

The other things i'm seeing here is related to filebeat
I see a lot of events like this:

cannot lock 'filestream::.global::native::5510748-65025', will not update registry for it
Failed to publish event: attempt to use a closed processor

e.g today 23251 event between (within 30s)

{"log.level":"error","@timestamp":"2023-03-06T09:48:54.695Z","message":"Failed to publish event: attempt to use a closed processor"
...
{"log.level":"error","@timestamp":"2023-03-06T09:49:14.594Z","message":"Failed to publish event: attempt to use a closed processor"

Closes: #2269

@michalpristas michalpristas added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify labels Mar 6, 2023
@michalpristas michalpristas requested a review from a team as a code owner March 6, 2023 13:37
@michalpristas michalpristas self-assigned this Mar 6, 2023
@michalpristas michalpristas requested review from michel-laterman and pchila and removed request for a team March 6, 2023 13:37
@elasticmachine
Copy link
Contributor

elasticmachine commented Mar 6, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-03-10T09:38:46.869+0000

  • Duration: 20 min 5 sec

Test stats 🧪

Test Results
Failed 0
Passed 5003
Skipped 15
Total 5018

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Mar 6, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.387% (61/62) 👍
Files 68.868% (146/212) 👍
Classes 67.98% (276/406) 👍
Methods 53.426% (850/1591) 👍
Lines 38.846% (9421/24252) 👍 0.078
Conditionals 100.0% (0/0) 💚

@jlind23
Copy link
Contributor

jlind23 commented Mar 6, 2023

@michalpristas does it relate any existing issue?

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of open questions but in general I feel like we need to refactor a bit the interaction among the goroutines.
We are using channels for notification but we pull the data from the provider directly and that makes the code we have to write awkward in places.

As for the PR, I didn't see any major flaw/bug but before approving I just need for someone to check that we are not changing some side effect we rely on by not synchronizing readers and writers on a channel


// wait for all providers to stop (but its possible they still send notifications over notify
// channel, and we cannot block them sending)
emptyChan, emptyCancel := context.WithCancel(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure what's the purpose of having a cancellable context when we call cancel upon leaving the function scope...
Is it to try and consume as many notifications as possible just up to the last possible moment once the providers signal that they are done?
If that is the idea, wouldn't using a simple channel to stop the consuming goroutine suffice instead of creating a context ?

(It was already present in the previous version so it's not critical for the fix, just food for thought)

default:
// c.ch is size of 1, nothing is reading and there's already a signal
select {
case <-c.ch:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strange pattern: we are consuming from the same channel we are producing on because we didn't manage to write another value.
I understand why it's here but this shows that we have some big sync issues if we have to solve them this way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with refactoring option as a followup. i'd rather have notifications in a form of .net events where producer does not need to consider readers.
as we talked about, i wanted this PR to be as least intrusive as possible so we can spend proper time with refactor without time pressure from production issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an issue for the refactor so we don't forget to do it please :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the possibility that this results in a vars set being lost? Confused on why this is really needed? There should always be a read and if there is not a reader the controller should really block to ensure that all vars are read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea behind is that when controller is handling something else this will become blocking,
when this is blocked no new vars are being processed and stored in mappings because we have a loop stuck in here. so this helps controller to get most recent mappings based on actual events

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just worry that if we empty the channel like it is here, then the reader of the channel starts reading again the set of variables will be removed and then it will not get the variables from the channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not happen when we drain we dont cancel a loop and will push vars again in next cycle

@@ -256,7 +281,8 @@ func (c *contextProviderState) Set(mapping map[string]interface{}) error {
return nil
}
c.mapping = mapping
c.signal <- true

go func() { c.signal <- true }()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we try to write multiple times to c.signal as well ? Or there are cases when there's no reader ready?
Is this the reason we write into the channel from a goroutine?
What about returning without this synchronization point? Are there any side effects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes both can be true, but multiple writes mostly, this is triggered on event OnAdd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of moving this into a go function? I would prefer not to create a go function on every change that will allocate a 4k stack for every little event with this change.

What are other options that we could do in this path? What is the real reason for this change?

Would the following work without creating a go routine:

select {
case c.signal <- true:
default:
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving this to go function will make sure signal is not lost somewhere
when you have a controller handling other stuff like a config writing to signal will fail and signal will be lost.
this way signal is not lost and goroutines are quickly destroyed when draining a channel

Copy link
Contributor Author

@michalpristas michalpristas Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason for this is that when notify is not read right away this will be hanging. with this hanging lock i not released and even OnAdd event receiver will be stuck on acquiring a lock.
notify is read in a different loop and is dependent on coordinator reading a value. so slow coordinator will make this stuck and create a lot of OnAdd hanging handlers waiting for locks
i can imagine having large number of OnAdd handlers hanging to produce out of order events as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If c.signal has a buffer size of 1 then we can ensure that it have a true sitting on the channel, which is all we need. So having the following:

select {
case c.signal <- true:
default:
}

Will ensure that at least one true is on the channel, and if another is already there then it will fallthrough to the default and continue on. This removes the need to have a goroutine.

@cmacknz
Copy link
Member

cmacknz commented Mar 6, 2023

You need a changelog entry https://github.com/elastic/elastic-agent#changelog

default:
// c.ch is size of 1, nothing is reading and there's already a signal
select {
case <-c.ch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the possibility that this results in a vars set being lost? Confused on why this is really needed? There should always be a read and if there is not a reader the controller should really block to ensure that all vars are read.

@@ -256,7 +281,8 @@ func (c *contextProviderState) Set(mapping map[string]interface{}) error {
return nil
}
c.mapping = mapping
c.signal <- true

go func() { c.signal <- true }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of moving this into a go function? I would prefer not to create a go function on every change that will allocate a 4k stack for every little event with this change.

What are other options that we could do in this path? What is the real reason for this change?

Would the following work without creating a go routine:

select {
case c.signal <- true:
default:
}

@@ -317,7 +343,8 @@ func (c *dynamicProviderState) AddOrUpdate(id string, priority int, mapping map[
mapping: mapping,
processors: processors,
}
c.signal <- true

go func() { c.signal <- true }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -329,32 +356,37 @@ func (c *dynamicProviderState) Remove(id string) {
if exists {
// existed; remove and signal
delete(c.mappings, id)
c.signal <- true

go func() { c.signal <- true }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@michalpristas
Copy link
Contributor Author

created an issue for refactor #2357

@cmacknz
Copy link
Member

cmacknz commented Mar 10, 2023

Is there any way we could reproduce the deadlock here in a unit test? The lack of test coverage here is part of the reason why we have this bug most likely.

@cmacknz
Copy link
Member

cmacknz commented Mar 10, 2023

@blakerouse can you do another review pass today? We really want this fix to be in 8.7.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and thanks for the changes. I prefer the fallthrough of the select instead of the spawning of a new goroutine on every notify, thanks!

@cmacknz
Copy link
Member

cmacknz commented Mar 10, 2023

I am not very familiar with the composable controller yet, so mostly I can determine the code is very complex and difficult to test. We will follow up on improving that separately.

Since we've confirmed that without this fix the agent can deadlock on k8s, let's go ahead and merge it. I'd rather have targeted tests, but at this point in the release cycle we are better served merging this and letting it soak test in the snapshot builds.

@cmacknz cmacknz merged commit 6664441 into elastic:main Mar 10, 2023
mergify bot pushed a commit that referenced this pull request Mar 10, 2023
* Reduced deadlock in controller on fast k8s events

* changelog

* Update 1678178226-Reduced-deadlocks-in-controller.yaml

* do not signal in a goroutine

(cherry picked from commit 6664441)
mergify bot pushed a commit that referenced this pull request Mar 10, 2023
* Reduced deadlock in controller on fast k8s events

* changelog

* Update 1678178226-Reduced-deadlocks-in-controller.yaml

* do not signal in a goroutine

(cherry picked from commit 6664441)
cmacknz pushed a commit that referenced this pull request Mar 10, 2023
cmacknz pushed a commit that referenced this pull request Mar 10, 2023
* Reduced deadlock in controller on fast k8s events

* changelog

* Update 1678178226-Reduced-deadlocks-in-controller.yaml

* do not signal in a goroutine

(cherry picked from commit 6664441)

Co-authored-by: Michal Pristas <michal.pristas@gmail.com>
@insukcho
Copy link
Contributor

insukcho commented Apr 5, 2023

@cmacknz which version we can use for this merged feature? 8.7.0?

@cmacknz
Copy link
Member

cmacknz commented Apr 5, 2023

Yes this fix was released in v8.7.0. The merge commit Reduce deadlock in controller is present in the v8.7.0 tag https://github.com/elastic/elastic-agent/commits/v8.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elastic Agent 8.6.x standalone deployment in Kubernetes doesn't start monitoring new pods until agent restart
7 participants