-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
swss: Fixing race condition for rif counters #2488
swss: Fixing race condition for rif counters #2488
Conversation
* Fixing issue #11621 * The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue sonic-net#2193 the cleanup for stale rif counters was added. * But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them. * So as a fix all such cleanup has been moved to syncd. Signed-off-by: Suman Kumar <suman.kumar@broadcom.com>
|
@Junchao-Mellanox , @kcudnik |
/easycla |
Added @prsunny and @liat-grozovik for merge/related-approval. |
@prsunny @liat-grozovik @kcudnik |
@Junchao-Mellanox @sumanbrcm should we also tagged it for 202205? |
yes. I added tag |
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added. But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them. So as a fix all such cleanup has been moved to syncd. - What I did Fix for sonic-net/sonic-buildimage#11621 As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error. To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition. - Why I did it The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db, the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit. - How I verified it Followed the steps in (sonic-net/sonic-buildimage#11621) : Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB Remove RIF Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen) Also checked cleanup for rif counters. After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP" 127) "Vlan100" 128) "oid:0x6000000000aa5" Checked all the tabled in COUNTER_DB which has same OID in keys 127.0.0.1:6379[2]> keys 6000000000aa5 1) "RATES:oid:0x6000000000aa5:RIF" 2) "COUNTERS:oid:0x6000000000aa5" 3) "RATES:oid:0x6000000000aa5" 127.0.0.1:6379[2]> Deleted the RIF by removing the ip on the intf. Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now. 127.0.0.1:6379[2]> keys 6000000000aa5 (empty array) 127.0.0.1:6379[2]> Signed-off-by: Suman Kumar <suman.kumar@broadcom.com>
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added.
But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
So as a fix all such cleanup has been moved to syncd.
What I did
Fix for sonic-net/sonic-buildimage#11621
As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error.
To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition.
Why I did it
The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db , the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit.
How I verified it
Followed the steps in (sonic-net/sonic-buildimage#11621) :
Also checked cleanup for rif counters.
After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP"
127) "Vlan100"
128) "oid:0x6000000000aa5"
Checked all the tabled in COUNTER_DB which has same OID in keys
127.0.0.1:6379[2]> keys 6000000000aa5
1) "RATES:oid:0x6000000000aa5:RIF"
2) "COUNTERS:oid:0x6000000000aa5"
3) "RATES:oid:0x6000000000aa5"
127.0.0.1:6379[2]>
Deleted the RIF by removing the ip on the intf.
Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now.
127.0.0.1:6379[2]> keys 6000000000aa5
(empty array)
127.0.0.1:6379[2]>
Details if related