-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[caclmgrd] Prevent unnecessary iptables updates #5312
Conversation
Hi Joe. when I run the old caclmgrd which I verify with, the install take ~ 5 sec for 345 Rules. where does the long delay is inserted, this create a security issue as every time you will run config reload the switch is vulnerable for long seconds. I have supplied the old caclmgrd config i am using today when i have submited the issue. please advise. /Ezrick |
Hi Joe. |
Retest vsimage please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fix has a major issue with performance as can be seen with the comments by Ezrik.
we dont know which version it started but as can be seen the acl manager version of working/not working has been provided.
solution should be something that a user can use with reasonable time. 45 seconds for 350 ACLs is not something that we should have.
@liat-grozovik: According to @Ezrickd, this patch will help eliminate fluctuation in the time. This fix helps prevent unnecessary iptables updates, so I believe it should be merged. However, there is apparently another underlying cause for the delay that @Ezrickd reported. It appears that notifications are "trickling in" to the application from the underlying Redis connection (~10 per second), rather that being received in a large batch. The root cause of the delay appears to be at a lower layer here. |
do we have this issue opened somewhere? |
@lguohan: I have opened an issue here: sonic-net/sonic-swss-common#393 |
# A brief sleep appears necessary in this loop or any spawned | ||
# update threads will get stuck. Appears to be due to the sel.select() call. | ||
# TODO: Eliminate the need for this sleep. | ||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jleveque Do we need this still ? Can we update SELECT_TIMEOUT_MS to take care of of this sleep ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot explain the root cause of the need for this sleep. I wish I could. Adjusting SELECT_TIMEOUT_MS
doesn't make a difference. It seems to be related to some internal Python context switching in conjunction with the sel.select()
call. If we don't sleep a tiny bit here, any spawned update threads will not make progress. It's very strange. By commenting out the sel.select()
call and mocking it up, the problem subsided, which is why I feel the sel.select()
call is somehow triggering the problem. I was unable to find any similar complaints on the web. This will require further, deep investigation in order to eliminate this sleep.
.format(namespace, new_changes, self.UPDATE_DELAY_SECS)) | ||
num_changes = self.num_changes[namespace] | ||
else: | ||
if num_changes == self.num_changes[namespace] and num_changes > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jleveque do we need num_changes > 0 check ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's basically a safeguard, in case num_changes == 0
and self.num_changes[namespace] == 0
. We shouldn't get here in this situation, but if we do we will hit the else case and log an error message. It could be helpful for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When a large number of changes occur to the ACL table of Config DB, caclmgrd will get flooded with notifications, and previously, it would regenerate and apply the iptables rules for each change, which is unnecessary, as the iptables rules should only get applied once after the last change notification is received. If the ACL table contains a large number of control plane ACL rules, this could cause a large delay in caclmgrd getting the rules applied. This patch causes caclmgrd to delay updating the iptables rules until it has not received a change notification for at least 0.5 seconds.
When a large number of changes occur to the ACL table of Config DB, caclmgrd will get flooded with notifications, and previously, it would regenerate and apply the iptables rules for each change, which is unnecessary, as the iptables rules should only get applied once after the last change notification is received. If the ACL table contains a large number of control plane ACL rules, this could cause a large delay in caclmgrd getting the rules applied. This patch causes caclmgrd to delay updating the iptables rules until it has not received a change notification for at least 0.5 seconds.
- Why I did it
When a large number of changes occur to the ACL table of Config DB, caclmgrd will get flooded with notifications, and previously, it would regenerate and apply the iptables rules for each change, which is unnecessary, as the iptables rules should only get applied once after the last change notification is received. If the ACL table contains a large number of control plane ACL rules, this could cause a large delay in caclmgrd getting the rules applied.
This patch causes caclmgrd to delay updating the iptables rules until it has not received a change notification for at least 0.5 seconds.
Resolves #5275
- How I did it
When an ACL table change notification is received for a particular namespace, we spawn an update thread for that namespace, which sits in a loop, sleeps 0.5 seconds and checks if any new change notifications were received while it was asleep. If new notification were received, it goes back to sleep and repeats the process until no new notifications were received while it was asleep. It then generates and applies the iptables rules.
- How to verify it
Load a large set of control plane ACL rules (see #5275 for an example)
- Which release branch to backport (provide reason below if selected)
This PR will not cleanly cherry-pick backwards. Separate PRs will be necessary.