Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock caused by acquiring priorityMutex twice #1186

Merged
merged 1 commit into from
Sep 2, 2020

Conversation

Dyanngg
Copy link
Contributor

@Dyanngg Dyanngg commented Sep 1, 2020

This PR fixes #1181

In NetworkPolicy reconciler, uninstallOFRule() is called in two circumstances:

  1. A NetworkPolicy rule is deleted by user, in which case it is called thru Forget().
  2. There has been updates to the original appliedTo pods of the NetworkPolicy rules, specifically delete events. If such delete event causes a ofRule's appliedTo to become empty, that ofRule should be removed following the logic
    Reconcile(), L199 --> update(), L483 --> uninstallOFRule(), L589.
    The priorityMutex of a CNP Table, which prevents concurrent priority updates within a single table, should be held at the Reconcile() and Forget() function level. Before this fix, the mutex is acquired at both Reconcile() and uninstallOFRule() in the second scenario mentioned above, causing a deadlock for that specific reconciler.

UT is added to test this second scenario:
1). Create a NP ingress rule on port "http" that applied to the following pods:

  • "pod1", "ns1", v1beta1.NamedPort{Name: "http", Protocol: v1beta1.ProtocolTCP, Port: 80}
  • "pod3", "ns1", v1beta1.NamedPort{Name: "http", Protocol: v1beta1.ProtocolTCP, Port: 443}
    as a result, two ofRules should be installed based on the port name resolution.
  1. Update appliedTo group in the ingress rule to simulate pod3 deletion event.
  2. Reconciler reconciles the rule. It should call uninstallOFRule() for the ofRule with service 443 without hitting deadlock.

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

@codecov-commenter
Copy link

codecov-commenter commented Sep 1, 2020

Codecov Report

Merging #1186 into master will increase coverage by 0.19%.
The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1186      +/-   ##
==========================================
+ Coverage   56.07%   56.26%   +0.19%     
==========================================
  Files         106      106              
  Lines       11539    11536       -3     
==========================================
+ Hits         6470     6491      +21     
+ Misses       4501     4478      -23     
+ Partials      568      567       -1     
Flag Coverage Δ
#integration-tests 47.44% <ø> (+0.03%) ⬆️
#unit-tests 41.67% <80.00%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/controller/networkpolicy/reconciler.go 72.22% <80.00%> (+4.29%) ⬆️
pkg/agent/util/iptables/iptables.go 0.00% <0.00%> (-5.27%) ⬇️
pkg/agent/util/sysctl/sysctl_linux.go 0.00% <0.00%> (ø)
pkg/ovs/openflow/ofctrl_bridge.go 72.90% <0.00%> (+0.66%) ⬆️
pkg/agent/controller/networkpolicy/priority.go 89.86% <0.00%> (+1.35%) ⬆️

@Dyanngg Dyanngg force-pushed the fix-deadlock branch 2 times, most recently from f1bf917 to f3be484 Compare September 1, 2020 00:55
@Dyanngg
Copy link
Contributor Author

Dyanngg commented Sep 1, 2020

/test-all

@Dyanngg
Copy link
Contributor Author

Dyanngg commented Sep 1, 2020

/test-hw-offload

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Dyanngg for the quick fix.

@Dyanngg
Copy link
Contributor Author

Dyanngg commented Sep 1, 2020

/test-conformance /test-e2e /test-networkpolicy

@abhiraut
Copy link
Contributor

abhiraut commented Sep 1, 2020

thanks for the detailed issue report ..

@Dyanngg
Copy link
Contributor Author

Dyanngg commented Sep 2, 2020

/test-conformance /test-e2e /test-networkpolicy

1 similar comment
@lzhecheng
Copy link
Contributor

/test-conformance /test-e2e /test-networkpolicy

@antoninbas antoninbas merged commit 8117fe7 into antrea-io:master Sep 2, 2020
antoninbas pushed a commit to antoninbas/antrea that referenced this pull request Sep 3, 2020
antoninbas pushed a commit to antoninbas/antrea that referenced this pull request Sep 3, 2020
antoninbas pushed a commit to antoninbas/antrea that referenced this pull request Sep 3, 2020
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Sep 22, 2020
@Dyanngg Dyanngg deleted the fix-deadlock branch September 30, 2020 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

network policy doesn't work (even deny all)
8 participants