Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when accessing dirtyRules in fqdn controller #5566

Merged
merged 1 commit into from
Oct 16, 2023

Conversation

Dyanngg
Copy link
Contributor

@Dyanngg Dyanngg commented Oct 10, 2023

Fixes #5565

As descried in the issue above, a deadlock could occur when :

  1. Rule sync fails due to realization error in dataplane, and the controller
    retries it later despite there's no IP changes for the FQDN.
  2. A single Antrea-native policy rule refers to multiple FQDNs, and the
    fqdn controller receives updated records for these FQDNs at around
    the same time. One of the FQDN has address updates so the rule
    need to be resynced and thus marked dirty. Another FQDN does not
    have address updates, but the controller tries to read the dirtyRules
    to make sure that, if it is a previously failed to sync rule, it gets
    re-queued even when there's no address updates.

This PR addresses the deadlock issue by explicitly putting the lock
around the get dirty rules op itself.
Additional UT for these potential rule sync scenarios are also added.

@Dyanngg Dyanngg added kind/bug Categorizes issue or PR as related to a bug. action/backport Indicates a PR that requires backports. labels Oct 10, 2023
@Dyanngg Dyanngg added this to the Antrea v1.14 release milestone Oct 10, 2023
@tnqn
Copy link
Member

tnqn commented Oct 11, 2023

@Dyanngg have you managed to reproduce the issue? You mentioned "a deadlock could occur when rule sync
fails due to realization error in dataplane" but I didn't see such error when investigating the issue and thought it could be triggered even everything works fine.

@Dyanngg
Copy link
Contributor Author

Dyanngg commented Oct 11, 2023

@Dyanngg have you managed to reproduce the issue? You mentioned "a deadlock could occur when rule sync fails due to realization error in dataplane" but I didn't see such error when investigating the issue and thought it could be triggered even everything works fine.

Thanks for the reminder. I reproduced the deadlock in UT when OVS error is simulated. Now that you mention it, I realized that the same issue could also happen if a single rule has multiple FQDNs, and during the handling of the proactive record update of these FQDNs, one of the FQDN response could have marked the rule dirty since the agent has not finished rule sync yet, while the other FQDN response tries to add a subscriber for the same rule, causing deadlock. I will add a new UT testcase validating this theory, and verify that the deadlock could occur before the fix and will not after the fix.

@Dyanngg Dyanngg force-pushed the fix-fqdn-rule-sync-deadlock branch from 7a4571c to 8393f87 Compare October 11, 2023 03:08
@Dyanngg
Copy link
Contributor Author

Dyanngg commented Oct 11, 2023

@tnqn Please check the updated PR description and the latest added testcase. While the specific scenario is tricky to reproduce in an real setup (due to the need for two concurrent DNS record updates and one with address change and one does not), the last UT testcase reproduces the deadlock very steadily: without the change, it hangs every single time after the syncDirtyRule() calls due to deadlock.

@Dyanngg Dyanngg requested a review from GraysonWu October 11, 2023 03:49
@tnqn tnqn added the action/release-note Indicates a PR that should be included in release notes. label Oct 11, 2023
@tnqn
Copy link
Member

tnqn commented Oct 11, 2023

@Dyanngg the code and unit test look good to me. However, could we still try to reproduce the issue in a cluster and see if this can completely fix the issue in case there is something else preventing it from working? We have met several issues related to NetworkPolicy (especially when FQDN rule is used) in the last few months, we need to be more cautious to deliver one more patch release. I think in theory it can be reproduced by generating two concurrent DNS resolution towards the same steady FQDN, the first resolution's addressUpdate will be true while the second one will be false.

And I just got an update from the reporter: restarting a workload pod alone could fix the issue, which I haven't figured out why.

Copy link
Contributor

@luolanzone luolanzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit

pkg/agent/controller/networkpolicy/fqdn_test.go Outdated Show resolved Hide resolved
@tnqn
Copy link
Member

tnqn commented Oct 11, 2023

And I just got an update from the reporter: restarting a workload pod alone could fix the issue, which I haven't figured out why.

I figured out this, it didn't really recover, the reason why new Pod can resolve domain should be because the realization of the FQDN rule for new Pod was stuck in r.fqdnController.addFQDNRule due to the same dead lock.

@Dyanngg
Copy link
Contributor Author

Dyanngg commented Oct 11, 2023

Trying to reproduce the issue but no luck so far. Domains that I remember to have dynamic IP ranges seems to return pretty steady resolved dns addresses.

Test yamls used:

apiVersion: v1
kind: ReplicationController
metadata:
  name: client
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: client
    spec:
      containers:
      - name: client-worker
        image: praqma/network-multitool
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 443
        command:
        - /bin/sh
        - -c
        - |
          while true; do
          dig bay.camera;
          done
---
apiVersion: v1
kind: ReplicationController
metadata:
  name: client2
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: client
    spec:
      containers:
      - name: client-worker2
        image: praqma/network-multitool
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 443
        command:
        - /bin/sh
        - -c
        - |
          while true; do
          dig google.com;
          done
---
apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: acnp-multiple-fqdn-per-rule
spec:
  priority: 1
  appliedTo:
  - podSelector:
      matchLabels:
        app: client
  egress:
  - action: Allow
    to:
    - fqdn: "wayfair.com"
    - fqdn: "google.com"
    - fqdn: "bay.camera"
    - fqdn: "amazon.com"
    - fqdn: "medium.com"

@Dyanngg Dyanngg closed this Oct 11, 2023
@Dyanngg Dyanngg reopened this Oct 11, 2023
Signed-off-by: Dyanngg <dingyang@vmware.com>
@Dyanngg Dyanngg force-pushed the fix-fqdn-rule-sync-deadlock branch from 8393f87 to 3a28b93 Compare October 11, 2023 21:07
@Dyanngg
Copy link
Contributor Author

Dyanngg commented Oct 11, 2023

/test-all

@tnqn
Copy link
Member

tnqn commented Oct 12, 2023

Trying to reproduce the issue but no luck so far. Domains that I remember to have dynamic IP ranges seems to return pretty steady resolved dns addresses.

The test commands you used triggered DNS lookups sequentially, I think it can't lead to deadlock.

While trying to construct a test case to reproduce the issue, I found that even concurrent DNS lookups can't trigger it because packetin events are processed sequentially. And given there is no error about networkpolicy realization, I'm thinking the issue is probably more complicated and this patch will not likely fix the issue completely. I will share my hypothesis in #5565

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn merged commit 62a440a into antrea-io:main Oct 16, 2023
44 checks passed
@tnqn
Copy link
Member

tnqn commented Oct 16, 2023

@Dyanngg please backport it to 1.11-1.13.

@tnqn
Copy link
Member

tnqn commented Oct 16, 2023

This commit introduces a data race. #5583 will fix it. The cherry-picking PRs need to include the latter commit as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A dead lock in DNS response handler
4 participants