-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deletion of individual NodePortLocal (NPL) mapping rules not working correctly #6281
Labels
area/proxy/nodeportlocal
Issues or PRs related to the NodePortLocal feature
kind/bug
Categorizes issue or PR as related to a bug.
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Comments
antoninbas
added
kind/bug
Categorizes issue or PR as related to a bug.
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
area/proxy/nodeportlocal
Issues or PRs related to the NodePortLocal feature
labels
May 2, 2024
antoninbas
added a commit
to antoninbas/antrea
that referenced
this issue
May 3, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes antrea-io#6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
to antoninbas/antrea
that referenced
this issue
May 6, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes antrea-io#6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
that referenced
this issue
May 7, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes #6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
to antoninbas/antrea
that referenced
this issue
May 7, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes antrea-io#6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
to antoninbas/antrea
that referenced
this issue
May 7, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes antrea-io#6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
to antoninbas/antrea
that referenced
this issue
May 7, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes antrea-io#6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
to antoninbas/antrea
that referenced
this issue
May 7, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes antrea-io#6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
that referenced
this issue
May 8, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes #6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
that referenced
this issue
May 8, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes #6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
that referenced
this issue
May 11, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes #6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas
added a commit
that referenced
this issue
May 13, 2024
The logic for deleting an individual NPL mapping was broken. It incorrectly believed that the protocol socket was still in use, and the mapping could never be deleted, putting the NPL controller in an endless error loop. The State field in ProtocolSocketData was left over from pre Antrea v1.7, back when we would always use the same port number for multiple protocols, for a give Pod IP + port. With the current version of the NPL implementation, this field is not needed and should be removed. By removing the field, we avoid the deletion issue. This patch also ensures that if a rule is only partially cleaned-up, we can attempt to delete it again, by making DeleteRule idempotent. To identify that a prior deletion attempt failed, we introduce a "defunct" field in the NPL rule data. If this field is set, the controller knows that the rule has been partially deleted and deletion needs to be attempted again. Without this, it would be possible for the controller (with the right sequence of updates) to assume that a partially-deleted rule is still valid, which would break the datapath. I plan on improving the NPL code further with a follow-up patch, but in order to keep this patch small (for back-porting), I went with the simplest solution I could think of. Fixes #6281 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/proxy/nodeportlocal
Issues or PRs related to the NodePortLocal feature
kind/bug
Categorizes issue or PR as related to a bug.
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Describe the bug
It is possible to create a situation where the NPL controller in the Antrea Agent cannot delete NPL rules appropriately when needed (on Linux Nodes). The DNAT rule gets deleted correctly from iptables, but the Pod annotation is not updated, and we get stuck in an error loop when trying to handle Pod updates.
To Reproduce
On a cluster running Antrea with NPL enabled, do the following:
Expected
After a short while, the Pod's NPL annotation should be updated to show a single mapping.
Actual behavior
The Pod's NPL annotation is never updated, and the antrea-agent logs (on the Node where the Pod is running) show the following errors:
This issue only exists when we (the NPL controller in the Agent) need to edit an existing NPL Pod annotation to remove one of the rules / mappings. When the Pod is deleted, or when NPL is completely "disabled" for a Pod (all mappings are removed), the issue does not arise. This is because we use different functions to handle these 2 different situations:
antrea/pkg/agent/nodeportlocal/portcache/port_table_others.go
Lines 155 to 170 in f9345da
antrea/pkg/agent/nodeportlocal/portcache/port_table_others.go
Lines 136 to 153 in f9345da
Versions:
All "recent" Antrea versions.
This issue was reported for Antrea v1.11.3, and I confirmed that the issue is still present in Antrea v2.0.0.
The text was updated successfully, but these errors were encountered: