Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teamd :: fix for cleaning up the teamd processes correctly on teamd docker stop #1159

Merged
merged 7 commits into from
Jan 15, 2020

Conversation

judyjoseph
Copy link
Contributor

@judyjoseph judyjoseph commented Dec 23, 2019

What I did

The changes implemented in the final fix are,
(1) In teammgrd: While adding LAG, save the mapping of lag <-> teamd PID
(2) In teammgrd: While removing LAG, remove the mapping of lag <-> teamd PID
(3) Introduce SIGTERM handlers in both teammgrd and teamsyncd

  • (a) In teammgrd, send the SIGTERM to all the teamd processes. It takes just a second for teamd processes + port channel interfaces to be cleaned up
  • (b) In teamsyncd, clean up the teamd resources like nl_socket etc.

Why I did it

There were multiple Issues seen, in random on doing a config reload/ config loadminigraph

(i) Teamsyncd segfault
(ii) Teamsyncd getting the NETLINK messages with the older IFINDEXs
(iii) "TeamPortSync: Unable to init team socket" error messages

Root cause

 With config reload command , the docker teamd stop and start is issued. I find that after "docker stop" the port channel interfaces remains in the linux kernel – it is not getting cleaned. There are indeed signal handlers in the teamd processes. But in this case teamd is started as a “daemon” with the “-d” option. The supervisord is not able to send the SIGTERM to the processes started in a daemon mode (http://supervisord.org/subprocess.html?highlight=daemon)

“docker stop” sends two signals

  1. SIGTERM to the ENTRYPOINT process in our case it is supervisord process

  2. After 10 sec send SIGKILL to all processes to stop the container.

    So when docker stop is issued, supervisord send SIGTERM to all processes ( teamsyncd/temmgrd gets the signal correctly – but I find none of the teamd processes get the signal either because they are daemonized and running in background OR because teamd is not started directly by supervisord).

The teamd processes are killed later with SIGKILL send as a last resort to stop docker. SIGKILL cannot be handled and hence no cleanup – portchanel interfaces remains in kernel.

How I verified it

Tried the config reload back to back multiple times -- verified that the Portchannel interface are cleaned up in kernel, teamd processes are killed correctly.
When the teamd docker is started again, I don't see random NETLINK create/delete messages with the older Portchannel ifindices. ( which was seen earlier without this fix )

Other possible fixes which I tried out, but decided not to take

The other possible fix options which was tried were,

(1) Use the fork/exec way to spawn the teamd processes

  • (a) Seeing instabilities in all teamd instances NOT coming up when we do “config reload” back to back.

  • (b) Even with teamd staying in foreground, those processes were not getting the SIGTERM on “docker stop”. So doesn’t serve the purpose of trying NOT to daemonize the teamd.

(2) In the teammgrd Signal handler directly invoke an existing API teammgrd:removeLag() which
cleans up the teamd processes. This was working fine, but the cleanup was taking longer
duration like around 5sec for cleaning up all the teamd processes, which very high
considering it is a signal handler.

When the teamd docker receives a stop signal, only the processes started by supervisord gets
the SIGTERM, so this fix is to propogate the signal to teamd processes
via the signal handler in teamsyncd process.
@madhukar-kamarapu
Copy link

madhukar-kamarapu commented Dec 24, 2019

@judyjoseph - this change would address bunch of issues we've been seeing in teamsyncd during config-reload and docker-restart. Thanks again.
One minor (log related) comment - can you please look into it.

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check comments

teamsyncd/teamsync.cpp Outdated Show resolved Hide resolved
teamsyncd/teamsync.cpp Outdated Show resolved Hide resolved
teamsyncd/teamsync.cpp Outdated Show resolved Hide resolved
teamsyncd/teamsync.cpp Outdated Show resolved Hide resolved
teamsyncd/teamsync.cpp Outdated Show resolved Hide resolved
teamsyncd/teamsync.cpp Outdated Show resolved Hide resolved
@judyjoseph
Copy link
Contributor Author

@judyjoseph - this change would address bunch of issues we've been seeing in teamsyncd during config-reload and docker-restart. Thanks again.
I have no major comments/concerns on this PR. One minor (log related) comment - can you please look into it.

@judyjoseph - this change would address bunch of issues we've been seeing in teamsyncd during config-reload and docker-restart. Thanks again.
I have no major comments/concerns on this PR. One minor (log related) comment - can you please look into it.

I did testing with config reload and config loadminigraph and seen things are ok. These commands does docker stop and docker start explicitly in a sequence starting with swss docker.

What is the use case of docker restart ? are you referring to "docker restart teamd"? Is this command supported ? Currently I see swss, syncd, teamd tightly bound and think it will go to an inconsistent state if we just restart teamd alone.

@madhukar-kamarapu
Copy link

madhukar-kamarapu commented Dec 24, 2019

@judyjoseph - this change would address bunch of issues we've been seeing in teamsyncd during config-reload and docker-restart. Thanks again.
I have no major comments/concerns on this PR. One minor (log related) comment - can you please look into it.

@judyjoseph - this change would address bunch of issues we've been seeing in teamsyncd during config-reload and docker-restart. Thanks again.
I have no major comments/concerns on this PR. One minor (log related) comment - can you please look into it.

I did testing with config reload and config loadminigraph and seen things are ok. These commands does docker stop and docker start explicitly in a sequence starting with swss docker.

What is the use case of docker restart ? are you referring to "docker restart teamd"? Is this command supported ? Currently I see swss, syncd, teamd tightly bound and think it will go to an inconsistent state if we just restart teamd alone.

I was referring to "systemctl restart teamd". Teamd-docker upgrade (sonic_installer upgrade_docker teamd docker-teamd.gz --cleanup_image -y) issues this command internally.
Say, there are NO configurations on the port-channel (VLAN, IP, etc) => docker upgrade can be done (it wouldn't impact applications).
Note: If there is any application configuration - all applications need to register for port-channel events like (create/delete/up/down/etc) to handle the teamd-docker-upgrade OR teamd-service-restart. Reason - new port-channel netdevices (with new ifindex) are created in the kernel.

… and checking for the flag in the main loop.

This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we
need not care for signal safety now.
Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM
signal when the teamd docker is stopped.
Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please answer on my questions?

cfgmgr/teammgr.h Outdated Show resolved Hide resolved
cfgmgr/teammgr.cpp Outdated Show resolved Hide resolved
cfgmgr/teammgr.cpp Show resolved Hide resolved
cfgmgr/teammgrd.cpp Show resolved Hide resolved
cfgmgr/teammgrd.cpp Show resolved Hide resolved
teamsyncd/teamsyncd.cpp Show resolved Hide resolved
}
else
{
SWSS_LOG_WARN("Cound not find PID corresponding to LaG %s ", it.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make it as an error?
Also It's better to say "Can't send TERM signal to LAG %s. PID wasn't found"

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good for me

@@ -17,13 +18,24 @@ bool gLogRotate = false;
ofstream gRecordOfs;
string gRecordFile;

bool received_sigterm = false;

void sig_handler(int signo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signo [](start = 21, length = 5)

check signo == SIGTERM first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Signal handler is only registered only for SIGTERM now. The handler will be called only on SIGTERM signal.

@@ -9,6 +10,14 @@
using namespace std;
using namespace swss;

bool received_sigterm = false;

void sig_handler(int signo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signo [](start = 21, length = 5)

the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Signal handler is only registered only for SIGTERM now. The handler will be called only on SIGTERM signal.

@@ -33,6 +45,12 @@ int main(int argc, char **argv)
s.addSelectable(&netlink);
while (true)
{
if(received_sigterm)
{
sync.cleanTeamSync();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 16, length = 2)

Inconsistent indentation.

yxieca pushed a commit that referenced this pull request Jan 16, 2020
…ocker stop (#1159)

* Send explicit signal to the teamd processes whenthe teamd docker exits.

When the teamd docker receives a stop signal, only the processes started by supervisord gets
the SIGTERM, so this fix is to propogate the signal to teamd processes
via the signal handler in teamsyncd process.

* Updates to take care of boundary conditions in the teamsyncd signal handler.

* Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop.
This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we
need not care for signal safety now.

* Updated the logic so that teammgrd controls the lifecycle of teamd.
Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM
signal when the teamd docker is stopped.

* Minor change in the function defenition

* Updates based on the comments

* Minor update in teammgr.cpp
abdosi pushed a commit that referenced this pull request Jan 21, 2020
…ocker stop (#1159)

* Send explicit signal to the teamd processes whenthe teamd docker exits.

When the teamd docker receives a stop signal, only the processes started by supervisord gets
the SIGTERM, so this fix is to propogate the signal to teamd processes
via the signal handler in teamsyncd process.

* Updates to take care of boundary conditions in the teamsyncd signal handler.

* Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop.
This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we
need not care for signal safety now.

* Updated the logic so that teammgrd controls the lifecycle of teamd.
Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM
signal when the teamd docker is stopped.

* Minor change in the function defenition

* Updates based on the comments

* Minor update in teammgr.cpp
prsunny pushed a commit that referenced this pull request Jan 28, 2020
…ocker stop (#1159)

* Send explicit signal to the teamd processes whenthe teamd docker exits.

When the teamd docker receives a stop signal, only the processes started by supervisord gets
the SIGTERM, so this fix is to propogate the signal to teamd processes
via the signal handler in teamsyncd process.

* Updates to take care of boundary conditions in the teamsyncd signal handler.

* Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop.
This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we
need not care for signal safety now.

* Updated the logic so that teammgrd controls the lifecycle of teamd.
Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM
signal when the teamd docker is stopped.

* Minor change in the function defenition

* Updates based on the comments

* Minor update in teammgr.cpp
@judyjoseph judyjoseph deleted the teamd_fix branch January 30, 2020 15:07
abdosi added a commit to abdosi/sonic-build-tools that referenced this pull request Sep 8, 2020
Test make sure cleanup happens of Port-channel Kernel devices.
This test case track the fixes done by PR:
sonic-net/sonic-swss#1407
sonic-net/sonic-swss#1159

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
abdosi added a commit to Azure/sonic-build-tools that referenced this pull request Sep 9, 2020
* Added the test case for Port Channel cleanup.
Test make sure cleanup happens of Port-channel Kernel devices.
This test case track the fixes done by PR:
sonic-net/sonic-swss#1407
sonic-net/sonic-swss#1159

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>

* Address Review Comments

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
sonic-net#1159)

* Global and Interface commands for IPv6 Link local feature
* SONiC CLI per interface configuration command to enable and disable the IPv6 link-local address mode when addresses are not configured manually.
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants