-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teamd :: fix for cleaning up the teamd processes correctly on teamd docker stop #1159
Conversation
When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process.
@judyjoseph - this change would address bunch of issues we've been seeing in teamsyncd during config-reload and docker-restart. Thanks again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check comments
I did testing with config reload and config loadminigraph and seen things are ok. These commands does docker stop and docker start explicitly in a sequence starting with swss docker. What is the use case of docker restart ? are you referring to "docker restart teamd"? Is this command supported ? Currently I see swss, syncd, teamd tightly bound and think it will go to an inconsistent state if we just restart teamd alone. |
I was referring to "systemctl restart teamd". Teamd-docker upgrade (sonic_installer upgrade_docker teamd docker-teamd.gz --cleanup_image -y) issues this command internally. |
… and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now.
Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please answer on my questions?
cfgmgr/teammgr.cpp
Outdated
} | ||
else | ||
{ | ||
SWSS_LOG_WARN("Cound not find PID corresponding to LaG %s ", it.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make it as an error?
Also It's better to say "Can't send TERM signal to LAG %s. PID wasn't found"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good for me
@@ -17,13 +18,24 @@ bool gLogRotate = false; | |||
ofstream gRecordOfs; | |||
string gRecordFile; | |||
|
|||
bool received_sigterm = false; | |||
|
|||
void sig_handler(int signo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signo [](start = 21, length = 5)
check signo == SIGTERM first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Signal handler is only registered only for SIGTERM now. The handler will be called only on SIGTERM signal.
@@ -9,6 +10,14 @@ | |||
using namespace std; | |||
using namespace swss; | |||
|
|||
bool received_sigterm = false; | |||
|
|||
void sig_handler(int signo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signo [](start = 21, length = 5)
the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Signal handler is only registered only for SIGTERM now. The handler will be called only on SIGTERM signal.
@@ -33,6 +45,12 @@ int main(int argc, char **argv) | |||
s.addSelectable(&netlink); | |||
while (true) | |||
{ | |||
if(received_sigterm) | |||
{ | |||
sync.cleanTeamSync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 16, length = 2)
Inconsistent indentation.
…ocker stop (#1159) * Send explicit signal to the teamd processes whenthe teamd docker exits. When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process. * Updates to take care of boundary conditions in the teamsyncd signal handler. * Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now. * Updated the logic so that teammgrd controls the lifecycle of teamd. Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped. * Minor change in the function defenition * Updates based on the comments * Minor update in teammgr.cpp
…ocker stop (#1159) * Send explicit signal to the teamd processes whenthe teamd docker exits. When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process. * Updates to take care of boundary conditions in the teamsyncd signal handler. * Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now. * Updated the logic so that teammgrd controls the lifecycle of teamd. Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped. * Minor change in the function defenition * Updates based on the comments * Minor update in teammgr.cpp
…ocker stop (#1159) * Send explicit signal to the teamd processes whenthe teamd docker exits. When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process. * Updates to take care of boundary conditions in the teamsyncd signal handler. * Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now. * Updated the logic so that teammgrd controls the lifecycle of teamd. Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped. * Minor change in the function defenition * Updates based on the comments * Minor update in teammgr.cpp
Test make sure cleanup happens of Port-channel Kernel devices. This test case track the fixes done by PR: sonic-net/sonic-swss#1407 sonic-net/sonic-swss#1159 Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Added the test case for Port Channel cleanup. Test make sure cleanup happens of Port-channel Kernel devices. This test case track the fixes done by PR: sonic-net/sonic-swss#1407 sonic-net/sonic-swss#1159 Signed-off-by: Abhishek Dosi <abdosi@microsoft.com> * Address Review Comments Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
sonic-net#1159) * Global and Interface commands for IPv6 Link local feature * SONiC CLI per interface configuration command to enable and disable the IPv6 link-local address mode when addresses are not configured manually. Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
What I did
The changes implemented in the final fix are,
(1) In teammgrd: While adding LAG, save the mapping of lag <-> teamd PID
(2) In teammgrd: While removing LAG, remove the mapping of lag <-> teamd PID
(3) Introduce SIGTERM handlers in both teammgrd and teamsyncd
Why I did it
(i) Teamsyncd segfault
(ii) Teamsyncd getting the NETLINK messages with the older IFINDEXs
(iii) "TeamPortSync: Unable to init team socket" error messages
Root cause
“docker stop” sends two signals
SIGTERM to the ENTRYPOINT process in our case it is supervisord process
After 10 sec send SIGKILL to all processes to stop the container.
So when docker stop is issued, supervisord send SIGTERM to all processes ( teamsyncd/temmgrd gets the signal correctly – but I find none of the teamd processes get the signal either because they are daemonized and running in background OR because teamd is not started directly by supervisord).
The teamd processes are killed later with SIGKILL send as a last resort to stop docker. SIGKILL cannot be handled and hence no cleanup – portchanel interfaces remains in kernel.
How I verified it
Tried the config reload back to back multiple times -- verified that the Portchannel interface are cleaned up in kernel, teamd processes are killed correctly.
When the teamd docker is started again, I don't see random NETLINK create/delete messages with the older Portchannel ifindices. ( which was seen earlier without this fix )
Other possible fixes which I tried out, but decided not to take
The other possible fix options which was tried were,
(1) Use the fork/exec way to spawn the teamd processes
(a) Seeing instabilities in all teamd instances NOT coming up when we do “config reload” back to back.
(b) Even with teamd staying in foreground, those processes were not getting the SIGTERM on “docker stop”. So doesn’t serve the purpose of trying NOT to daemonize the teamd.
(2) In the teammgrd Signal handler directly invoke an existing API teammgrd:removeLag() which
cleans up the teamd processes. This was working fine, but the cleanup was taking longer
duration like around 5sec for cleaning up all the teamd processes, which very high
considering it is a signal handler.