Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[201911][caclmgrd] core dump is generated during WR ARP test #11150

Closed
nazariig opened this issue Jun 15, 2022 · 3 comments
Closed

[201911][caclmgrd] core dump is generated during WR ARP test #11150

nazariig opened this issue Jun 15, 2022 · 3 comments
Labels
Bug 🐛 Triaged this issue has been triaged warmboot 👢

Comments

@nazariig
Copy link
Collaborator

Description

The issue is caused by DB unavailability during warm-reboot flow: the container is being stopped before the caclmgrd.
The same situation is also relevant for hostcfgd

Steps to reproduce the issue:

  1. Stop all docker services
root@sonic:/home/admin# docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS               NAMES
1f3115178b5e        docker-database:latest   "/usr/local/bin/dock…"   4 hours ago         Up About an hour                        database
  1. Stop database container
root@sonic:/home/admin# docker stop database
  1. Observe coredump
root@sonic:/home/admin# grep 'cacl' /var/log/syslog
Jun 14 14:17:44.024436 sonic INFO caclmgrd[2892]: terminate called after throwing an instance of 'std::runtime_error'
Jun 14 14:17:44.025580 sonic INFO caclmgrd[2892]:   what():  Unable to read redis reply
Jun 14 14:17:44.303228 sonic WARNING systemd[1]: caclmgrd.service: Main process exited, code=dumped, status=6/ABRT
Jun 14 14:17:44.305897 sonic NOTICE systemd[1]: caclmgrd.service: Unit entered failed state.
Jun 14 14:17:44.306827 sonic WARNING systemd[1]: caclmgrd.service: Failed with result 'core-dump'.

Note: The bug is reproducible also with warm-reboot command

Describe the results you received:

Test:

==================================== ERRORS ====================================
___________________ ERROR at teardown of TestWrArp.testWrArp ___________________

self = <allure_commons._allure.fixture object at 0x7fe46f6dc6d0>, args = ()
kwargs = {}

    def __call__(self, *args, **kwargs):
        self.parameters = func_parameters(self._fixture_function, *args, **kwargs)

        with self:
>           return self._fixture_function(*args, **kwargs)

args       = ()
kwargs     = {}
self       = <allure_commons._allure.fixture object at 0x7fe46f6dc6d0>

/usr/local/lib/python2.7/dist-packages/allure_commons/_allure.py:197:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

duthost = <MultiAsicSonicHost r-tigris-25>

    @pytest.fixture(scope="module", autouse=True)
    def verify_new_core_dumps(duthost):
        if "20191130" in duthost.os_version:
            pre_existing_cores = duthost.shell('ls /var/core/ | grep -v python')['stdout'].split()
        else:
            pre_existing_cores = duthost.shell('ls /var/core/')['stdout'].split()

        yield
        if "20191130" in duthost.os_version:
            cur_cores = duthost.shell('ls /var/core/ | grep -v python')['stdout'].split()
        else:
            cur_cores = duthost.shell('ls /var/core/')['stdout'].split()

        new_core_dumps = set(cur_cores) - set(pre_existing_cores)
        # convert to list so print msg will not contain "set()"
        new_core_dumps = list(new_core_dumps)

        if new_core_dumps:
            pytest.fail("Core dumps found. Expected: %s Found: %s. Test failed. New core dumps: %s" % (len(pre_existing_cores),\
>               len(cur_cores), new_core_dumps))
E           Failed: Core dumps found. Expected: 0 Found: 1. Test failed. New core dumps: [u'python.1655216264.2892.core.gz']

cur_cores  = ['python.1655216264.2892.core.gz']
duthost    = <MultiAsicSonicHost r-tigris-25>
new_core_dumps = ['python.1655216264.2892.core.gz']
pre_existing_cores = []

conftest.py:1596: Failed
=========================== short test summary info ============================
ERROR arp/test_wr_arp.py::TestWrArp::testWrArp - Failed: Core dumps found. Ex...
===================== 1 passed, 1 error in 571.80 seconds ======================

Log:

root@sonic:/home/admin# tail -F /var/log/syslog
Jun 14 14:17:44.013819 sonic INFO hostcfgd[10289]: Traceback (most recent call last):
Jun 14 14:17:44.015133 sonic INFO hostcfgd[10289]:   File "/usr/bin/hostcfgd", line 576, in <module>
Jun 14 14:17:44.016480 sonic INFO hostcfgd[10289]:     main()
Jun 14 14:17:44.017509 sonic INFO hostcfgd[10289]:   File "/usr/bin/hostcfgd", line 572, in main
Jun 14 14:17:44.018512 sonic INFO hostcfgd[10289]:     daemon.start()
Jun 14 14:17:44.019508 sonic INFO hostcfgd[10289]:   File "/usr/bin/hostcfgd", line 567, in start
Jun 14 14:17:44.020506 sonic INFO hostcfgd[10289]:     self.config_db.listen()
Jun 14 14:17:44.021436 sonic INFO hostcfgd[10289]:   File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 104, in listen
Jun 14 14:17:44.022222 sonic INFO hostcfgd[10289]:     for item in self.pubsub.listen():
Jun 14 14:17:44.023141 sonic INFO hostcfgd[10289]:   File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2501, in listen
Jun 14 14:17:44.024436 sonic INFO caclmgrd[2892]: terminate called after throwing an instance of 'std::runtime_error'
Jun 14 14:17:44.025580 sonic INFO caclmgrd[2892]:   what():  Unable to read redis reply
Jun 14 14:17:44.026602 sonic INFO hostcfgd[10289]:     response = self.handle_message(self.parse_response(block=True))
Jun 14 14:17:44.027340 sonic INFO hostcfgd[10289]:   File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2430, in parse_response
Jun 14 14:17:44.028092 sonic INFO hostcfgd[10289]:     return self._execute(connection, connection.read_response)
Jun 14 14:17:44.028735 sonic INFO hostcfgd[10289]:   File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2415, in _execute
Jun 14 14:17:44.029399 sonic INFO hostcfgd[10289]:     connection.connect()
Jun 14 14:17:44.030036 sonic INFO hostcfgd[10289]:   File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 489, in connect
Jun 14 14:17:44.030746 sonic INFO hostcfgd[10289]:     raise ConnectionError(self._error_message(e))
Jun 14 14:17:44.031382 sonic INFO hostcfgd[10289]: redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.
Jun 14 14:17:44.042218 sonic NOTICE systemd[1]: hostcfgd.service: Main process exited, code=exited, status=1/FAILURE
Jun 14 14:17:44.044502 sonic NOTICE systemd[1]: hostcfgd.service: Unit entered failed state.
Jun 14 14:17:44.045415 sonic WARNING systemd[1]: hostcfgd.service: Failed with result 'exit-code'.
Jun 14 14:17:44.084041 sonic INFO containerd[537]: time="2022-06-14T14:17:44.083869385Z" level=info msg="shim reaped" id=1f3115178b5e42ffe023c847fb34184abc8b9f5d
8322d7704e0bfe595acad871
Jun 14 14:17:44.094007 sonic INFO dockerd[779]: time="2022-06-14T14:17:44.093886551Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/
tasks/delete type="*events.TaskDelete"
Jun 14 14:17:44.165933 sonic INFO database.sh[2829]: 137
Jun 14 14:17:44.173126 sonic INFO systemd[1]: Stopped BGP container.
Jun 14 14:17:44.173570 sonic INFO systemd[1]: Stopping Process and docker CPU/memory utilization data export daemon...
Jun 14 14:17:44.173964 sonic INFO procdockerstatsd[2895]: DaemonBase: Caught SIGTERM - exiting...
Jun 14 14:17:44.174379 sonic INFO systemd[1]: Stopped TACACS application.
Jun 14 14:17:44.174767 sonic INFO systemd[1]: Stopped Delays tacacs apply until SONiC has started.
Jun 14 14:17:44.175137 sonic INFO systemd[1]: Stopping Control Plane ACL configuration daemon...
Jun 14 14:17:44.175709 sonic INFO systemd[1]: Stopped Generate Monit configuration file from template.
Jun 14 14:17:44.176557 sonic INFO systemd[1]: Stopped switch state service.
Jun 14 14:17:44.184609 sonic NOTICE systemd[1]: procdockerstatsd.service: Main process exited, code=exited, status=143/n/a
Jun 14 14:17:44.185701 sonic INFO systemd[1]: Stopped Process and docker CPU/memory utilization data export daemon.
Jun 14 14:17:44.186104 sonic NOTICE systemd[1]: procdockerstatsd.service: Unit entered failed state.
Jun 14 14:17:44.186510 sonic WARNING systemd[1]: procdockerstatsd.service: Failed with result 'exit-code'.
Jun 14 14:17:44.219804 sonic INFO database.sh[18430]: database
Jun 14 14:17:44.224088 sonic INFO systemd[1]: Stopped Database container.
Jun 14 14:17:44.226831 sonic NOTICE admin: Stopped all remaining containers ...
Jun 14 14:17:44.303228 sonic WARNING systemd[1]: caclmgrd.service: Main process exited, code=dumped, status=6/ABRT
Jun 14 14:17:44.304883 sonic INFO systemd[1]: Stopped Control Plane ACL configuration daemon.
Jun 14 14:17:44.305897 sonic NOTICE systemd[1]: caclmgrd.service: Unit entered failed state.
Jun 14 14:17:44.306827 sonic WARNING systemd[1]: caclmgrd.service: Failed with result 'core-dump'.
Jun 14 14:17:44.307810 sonic INFO systemd[1]: Stopped Update minigraph and set configuration based on minigraph.
Jun 14 14:17:44.309059 sonic INFO systemd[1]: Stopped Config initialization and migration service.
Jun 14 14:17:44.310044 sonic INFO systemd[1]: Stopped Database container.

Coredump:

root@sonic:/home/admin# cgdb python python.1655219505.27120.core
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 27120]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python /usr/bin/caclmgrd'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f9bf9e4bfff in
---Type <return> to continue, or q <return> to quit---
(gdb) bt
#0  0x00007f9bf9e4bfff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f9bf9e4d42a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f9bf64a60ad in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f9bf64a4066 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f9bf64a40b1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f9bf64a42c9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f9bf78609ea in swss::SubscriberStateTable::readData() () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#7  0x00007f9bf7850017 in swss::Select::poll_descriptors(swss::Selectable**, unsigned int) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#8  0x00007f9bf78501db in swss::Select::select(swss::Selectable**, int) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#9  0x00007f9bf7a98deb in _wrap_Select_select__SWIG_0 (args=<optimized out>) at swsscommon_wrap.cpp:13058
#10 _wrap_Select_select (self=<optimized out>, args=<optimized out>) at swsscommon_wrap.cpp:13145
#11 0x000055c450b39eea in PyEval_EvalFrameEx ()
#12 0x000055c450b38005 in PyEval_EvalCodeEx ()
#13 0x000055c450b3fa79 in PyEval_EvalFrameEx ()
#14 0x000055c450b3f77f in PyEval_EvalFrameEx ()
#15 0x000055c450b3f77f in PyEval_EvalFrameEx ()
#16 0x000055c450b38005 in PyEval_EvalCodeEx ()
#17 0x000055c450b37da9 in PyEval_EvalCode ()
#18 0x000055c450b6843f in ?? ()
#19 0x000055c450b633d2 in PyRun_FileExFlags ()
#20 0x000055c450b62f1e in PyRun_SimpleFileExFlags ()
#21 0x000055c450b13d01 in Py_Main ()
#22 0x00007f9bf9e392e1 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#23 0x000055c450b1352a in _start ()

Describe the results you expected:

No coredump is expected

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

  • N/A
@prsunny
Copy link
Contributor

prsunny commented Jun 15, 2022

@sumukhatv , @vaibhavhd for viz

@vaibhavhd
Copy link
Contributor

This is a known issue on 201911. The issue was fixed on 202012 as part of sonic-net/sonic-utilities#2133 and #10511.

This issue is so far not known to cause any functional issues in the down or up path. The fix #10511 touches a lot of files and changes shutdown path for them. Hence, this was not ported to 201911.

@yxieca yxieca added warmboot 👢 Triaged this issue has been triaged labels Jun 22, 2022
@yxieca
Copy link
Contributor

yxieca commented Jun 22, 2022

Close according to the discussion on the issue.

@yxieca yxieca closed this as completed Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Triaged this issue has been triaged warmboot 👢
Projects
None yet
Development

No branches or pull requests

4 participants