Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix single rule deletion for NodePortLocal on Linux #6284

Merged
merged 2 commits into from
May 7, 2024

Conversation

antoninbas
Copy link
Contributor

The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop.

The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of.

Fixes #6281

@antoninbas antoninbas requested review from jianjuns and tnqn May 3, 2024 22:11
tnqn
tnqn previously approved these changes May 6, 2024
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

// If conditionFn is nil, we will assume you are looking for a non-existing annotation.
// If you want to match all, conditionMatchAll as the conditionFn.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If you want to match all, conditionMatchAll as the conditionFn.
// If you want to match all, use conditionMatchAll as the conditionFn.

antoninbas added 2 commits May 6, 2024 09:55
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes antrea-io#6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Comment on lines +880 to +901
// This function will be executed synchronously when DeleteRule is called for the first time
// and we simulate a failure. It restores the second target port for the Service, which was
// deleted previously, and waits for the change to be reflected in the informer's
// store. After that, we know that the next time the NPL controller processes the test Pod,
// it will need to ensure that both NPL mappings are configured correctly. Because one of
// the rules will be marked as "defunct", it will first need to delete the rule properly
// before adding it back.
restoreServiceTargetPorts := func() {
testSvc.Spec.Ports = ports
_, err := testData.k8sClient.CoreV1().Services(defaultNS).Update(context.TODO(), testSvc, metav1.UpdateOptions{})
if !assert.NoError(t, err) {
return
}
assert.EventuallyWithT(t, func(c *assert.CollectT) {
obj, exists, err := testData.svcInformer.GetIndexer().GetByKey(testSvc.Namespace + "/" + testSvc.Name)
if !assert.NoError(t, err) || !assert.True(t, exists) {
return
}
svc := obj.(*corev1.Service)
assert.Len(t, svc.Spec.Ports, 2)
}, 2*time.Second, 50*time.Millisecond)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn do you think this is ok? I couldn't think of an alternative, besides moving these tests to pkg/agent/nodeportlocal/k8s and calling handleAddUpdatePod directly, which would have been a bigger change and I wanted to keep this patch small. I am planning to re-organize some of the NPL code in a subsequent PR (that won't be backported), so maybe I can improve this test as part of the re-org PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current test works for me. The only way that can makes it simper I can think of is to construct a defunct entry in advance, then verify it won't be reused and will be deleted eventually once iptables succeeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, but it would require a test-specific function (exported) to mark the entry as defunct. I'll keep that in mind for the refactoring.

@antoninbas
Copy link
Contributor Author

/test-all
/test-windows-all

@antoninbas antoninbas requested a review from tnqn May 7, 2024 03:12
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +880 to +901
// This function will be executed synchronously when DeleteRule is called for the first time
// and we simulate a failure. It restores the second target port for the Service, which was
// deleted previously, and waits for the change to be reflected in the informer's
// store. After that, we know that the next time the NPL controller processes the test Pod,
// it will need to ensure that both NPL mappings are configured correctly. Because one of
// the rules will be marked as "defunct", it will first need to delete the rule properly
// before adding it back.
restoreServiceTargetPorts := func() {
testSvc.Spec.Ports = ports
_, err := testData.k8sClient.CoreV1().Services(defaultNS).Update(context.TODO(), testSvc, metav1.UpdateOptions{})
if !assert.NoError(t, err) {
return
}
assert.EventuallyWithT(t, func(c *assert.CollectT) {
obj, exists, err := testData.svcInformer.GetIndexer().GetByKey(testSvc.Namespace + "/" + testSvc.Name)
if !assert.NoError(t, err) || !assert.True(t, exists) {
return
}
svc := obj.(*corev1.Service)
assert.Len(t, svc.Spec.Ports, 2)
}, 2*time.Second, 50*time.Millisecond)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current test works for me. The only way that can makes it simper I can think of is to construct a defunct entry in advance, then verify it won't be reused and will be deleted eventually once iptables succeeds.

@antoninbas antoninbas merged commit 78eda7a into antrea-io:main May 7, 2024
58 of 61 checks passed
@antoninbas antoninbas deleted the fix-npl-rule-deletion branch May 7, 2024 17:06
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 7, 2024
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes antrea-io#6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 7, 2024
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes antrea-io#6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas antoninbas added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels May 7, 2024
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 7, 2024
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes antrea-io#6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 7, 2024
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes antrea-io#6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit that referenced this pull request May 8, 2024
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes #6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit that referenced this pull request May 8, 2024
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes #6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit that referenced this pull request May 11, 2024
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes #6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit that referenced this pull request May 13, 2024
The logic for deleting an individual NPL mapping was broken. It
incorrectly believed that the protocol socket was still in use, and the
mapping could never be deleted, putting the NPL controller in an endless
error loop.

The State field in ProtocolSocketData was left over from pre Antrea
v1.7, back when we would always use the same port number for multiple
protocols, for a give Pod IP + port. With the current version of the NPL
implementation, this field is not needed and should be removed. By
removing the field, we avoid the deletion issue.

This patch also ensures that if a rule is only partially cleaned-up, we
can attempt to delete it again, by making DeleteRule idempotent. To
identify that a prior deletion attempt failed, we introduce a "defunct"
field in the NPL rule data. If this field is set, the controller knows
that the rule has been partially deleted and deletion needs to be
attempted again. Without this, it would be possible for the controller
(with the right sequence of updates) to assume that a partially-deleted
rule is still valid, which would break the datapath. I plan on improving
the NPL code further with a follow-up patch, but in order to keep this
patch small (for back-porting), I went with the simplest solution I
could think of.

Fixes #6281

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas antoninbas mentioned this pull request Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deletion of individual NodePortLocal (NPL) mapping rules not working correctly
4 participants