Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some of Netdevice created by teamd are not cleaned up when teamd service is disabled. #1450

Merged
merged 1 commit into from
Sep 28, 2020

Conversation

abdosi
Copy link
Contributor

@abdosi abdosi commented Sep 27, 2020

Why I did:

The Netdevice created by teamd process not getting clean when Teamd feature/docker/service is disabled/stop
as observed and fixed by 2PR's -

#1159
#1407

is still being observed on Multi-asic platforms consistently and also on KVM VS testbed ( it is not consistent but happens every few runs (sonic-net/sonic-buildimage#5432)

Issue seems to be timing issue where
SIGTERM (generated by docker stop) and send via kill() of teammgrd to teamd was not cleaning
all teamd process netdevice resources.

How I did:
So fix is Instead of sending explicit SIGTERM via kill() system call to teamd we are calling
teamd -k. Using this teamd itself generate SIGTERM (https://github.com/jpirko/libteam/blob/master/teamd/teamd.c#L1861) and handle the processing.

With this change reverted (Lag alias <-> to pid mapping) done in PR #1159 as these is not needed as we are not using using kill() call.

How I verified:
a) Executed the below script 50 times on Multu-asic platforms and things were fine.

   #!/bin/bash

    i=0
    while [ $i -lt 50 ]; do
        echo "iteration $i"
        sudo config feature state teamd disabled
        sleep 60
        rc=0
        (sudo ip -n asic0 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic1 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic2 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic3 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic4 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic5 link show | grep -qc PortChannel) && rc=$((rc+1))
        if [ $rc -ne 0 ]; then
           echo "Failed iteration $i for portchannel cleanup"
           break
        fi
        sudo config feature state teamd enabled
        sleep 60
        (sudo ip -n asic0 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic1 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic2 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic3 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic4 link show | grep -qc PortChannel) && rc=$((rc+1))
        (sudo ip -n asic5 link show | grep -qc PortChannel) && rc=$((rc+1))
        if [ $rc != 6 ]; then
           echo "Failed iteration $i for portchannel bringup"
           break
        fi
        i=$((i+1))
    done
    echo "Completed $i iteration sucessfully"

b) on KVM VS

    #!/bin/bash

    i=0
    while [ $i -lt 50 ]; do
        echo "iteration $i"
        sudo config feature state teamd disabled
        sleep 60
        rc=0
        (sudo ip link show | grep -qc PortChannel) && rc=$((rc+1))
        if [ $rc -ne 0 ]; then
           echo "Failed iteration $i for portchannel cleanup"
           break
        fi
        sudo config feature state teamd enabled
        sleep 60
        (sudo ip link show | grep -qc PortChannel) && rc=$((rc+1))
        if [ $rc != 1 ]; then
           echo "Failed iteration $i for portchannel bringup"
           break
        fi
        i=$((i+1))
    done
    echo "Completed $i iteration sucessfully"

a) Before fix:
admin@vlab-01:~$ bash teamd_cleanup.sh
iteration 0
iteration 1
iteration 2
iteration 3
Failed iteration 3 for portchannel cleanup
Completed 3 iteration sucessfully

admin@vlab-01:~$ docker ps | grep -c teamd
0
admin@vlab-01:~$ ip link show | grep PortChannel
330: PortChannel0004: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT 
group default qlen 1000

After fix looks fine.

c) Run script to make sure BGP is fine after config reload on multi-asic platforms

    #!/bin/bash

    i=0
    while [ $i -lt 25 ]; do
        echo "iteration $i"
        sudo config reload -y
        sleep 180
        rc=0
        (show ip bgp summary -n asic0 -d all | grep -qic Active) && rc=$((rc+1))
        (show ip bgp summary -n asic0 -d all | grep -qic Connect) && rc=$((rc+1))
        (show ip bgp summary -n asic1 -d all | grep -qic Active) && rc=$((rc+1))
        (show ip bgp summary -n asic1 -d all | grep -qic Connect) && rc=$((rc+1))
        (show ip bgp summary -n asic2 -d all | grep -qic Active) && rc=$((rc+1))
        (show ip bgp summary -n asic2 -d all | grep -qic Connect) && rc=$((rc+1))
        (show ip bgp summary -n asic3 -d all | grep -qic Active) && rc=$((rc+1))
        (show ip bgp summary -n asic3 -d all | grep -qic Connect) && rc=$((rc+1))
        (show ip bgp summary -n asic4 -d all | grep -qic Active) && rc=$((rc+1))
        (show ip bgp summary -n asic4 -d all | grep -qic Connect) && rc=$((rc+1))
        (show ip bgp summary -n asic5 -d all | grep -qic Active) && rc=$((rc+1))
        (show ip bgp summary -n asic5 -d all | grep -qic Connect) && rc=$((rc+1))
        if [ $rc -ne 0 ]; then
           echo "Failed iteration $i all BGP not up"
           break
        fi
        i=$((i+1))
    done
    echo "Completed $ iteration sucessfully"

@abdosi abdosi changed the title Some of Netdevice created by teamd are not cleaned up when teamd servie is disabled. Some of Netdevice created by teamd are not cleaned up when teamd service is disabled. Sep 27, 2020
not cleaned up.

Issue was seen in Multi-asic platform and seems to be timing issue where
SIGTERM send via kill systemcall of teammgrd to teamd was not cleaning
all teamd process.

Sp fix is Instead  of sending explicit SIGTERM to teamd we are calling
teamd -k. Using this teamd itself generate SIGTERM and handle the
processing.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@judyjoseph
Copy link
Contributor

judyjoseph commented Sep 28, 2020

LGTM.
As an additional trial, in the test script .. could you try "sudo config reload -y" instead of "sudo config feature state teamd disabled" ? This is just to make sure .. when we have more CPU usage with more processes going down-- the cleanup still happens correctly.

@judyjoseph Running test to make sure all BGP session are up after config reload on Multi-asic platforms. This way it will verify end -to end as portchannel gets cleaned up and ip interface are recreated correct on those port-channel and BGP is up.

On Multi-asic platforms we have seen once netdev create failed then assigning ip address to those interface also fail and then finally bgp also does not come up.

@judyjoseph Updated config reload result with BGP summary and looks good.

@abdosi abdosi merged commit 76e2251 into sonic-net:master Sep 28, 2020
@abdosi abdosi deleted the teamd_kill branch September 28, 2020 21:06
abdosi added a commit that referenced this pull request Sep 28, 2020
…ere (#1450)

not cleaned up.

Issue was seen in Multi-asic platform and seems to be timing issue where
SIGTERM send via kill systemcall of teammgrd to teamd was not cleaning
all teamd process.

Sp fix is Instead  of sending explicit SIGTERM to teamd we are calling
teamd -k. Using this teamd itself generate SIGTERM and handle the
processing.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Oct 2, 2020
…ere (sonic-net#1450)

not cleaned up.

Issue was seen in Multi-asic platform and seems to be timing issue where
SIGTERM send via kill systemcall of teammgrd to teamd was not cleaning
all teamd process.

Sp fix is Instead  of sending explicit SIGTERM to teamd we are calling
teamd -k. Using this teamd itself generate SIGTERM and handle the
processing.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
theasianpianist added a commit to theasianpianist/sonic-swss that referenced this pull request Oct 2, 2020
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
…1450)

Migrate from using the `imp` module to using the `importlib` module. As of Python 3, the `imp` module has been deprecated in favor of the `importlib` module.

Place logic in a new function, `load_module_from_source()` in a new file, `utilities_common/general.py`

Also fix some formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants