Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMON container crashes in latest SONiC images #5986

Open
ciju-juniper opened this issue Nov 20, 2020 · 22 comments
Open

PMON container crashes in latest SONiC images #5986

ciju-juniper opened this issue Nov 20, 2020 · 22 comments

Comments

@ciju-juniper
Copy link
Contributor

Description
Seeing issues with 'pmon' container startup and the following error in the syslog. All the platform monitoring daemons are killed and pmon also stopped after a few trials

Nov 19 18:48:15.770882 sonic INFO pmon#supervisord: start sonic-platform package not installed, attempting to install...
Nov 19 18:48:15.770882 sonic INFO pmon#supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py2-none-any.whl
Nov 19 18:48:16.060439 sonic INFO pmon#supervisord: start sonic-platform package not installed, attempting to install...
Nov 19 18:48:16.060439 sonic INFO pmon#supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py3-none-any.whl
Nov 19 18:48:16.133494 sonic INFO pmon#supervisord: xcvrd Traceback (most recent call last):
Nov 19 18:48:16.133494 sonic INFO pmon#supervisord: xcvrd   File "/usr/local/bin/xcvrd", line 5, in <module>
Nov 19 18:48:16.133494 sonic INFO pmon#supervisord: xcvrd     from src.xcvrd import main
Nov 19 18:48:16.133494 sonic INFO pmon#supervisord: xcvrd ImportError: No module named src.xcvrd
Nov 19 18:48:16.142902 sonic INFO pmon#supervisor-proc-exit-listener: Process xcvrd exited unxepectedly. Terminating supervisor...

Initial Triage
The last good build on the master branch was build# 481 and pmon crashes are seen from build# 482 onwards. In between, there are a few commits in which I suspect the following commit introduced the breakage:

[submodule]: update sonic-platform-daemons (#5868) (detail / githubweb)

Platform details
This problem should be there in most of the platforms. I had tested it on Juniper QFX5210 & QFX5200 platforms.

root@sonic:~# show version

SONiC Software Version: SONiC.master.482-aee389e4
Distribution: Debian 10.6
Kernel: 4.19.0-9-2-amd64
Build commit: aee389e4
Build date: Wed Nov 11 06:51:47 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-juniper_qfx5200-r0
HwSKU: Juniper-QFX5200-32C-S
ASIC: broadcom
Serial Number: ZA0220160436
Uptime: 17:52:40 up 2 min,  1 user,  load average: 1.76, 0.66, 0.24

show techsupport
Here is the techsupport archive:
sonic_dump_sonic_20201120_175417.tar.gz

@vdahiya12 Could you please take a look? Please let me know if you need any further details. Also, please suggest if there are any platform side changes required after this PR #5868

@vdahiya12
Copy link
Contributor

#5924 this PR should fix the pmon and is already in master.
is this on latest master ? #5924

@ciju-juniper
Copy link
Contributor Author

ciju-juniper commented Nov 20, 2020

@vdahiya12 The last image I tried was build# 493. Pmon was crashing on this.

@vdahiya12
Copy link
Contributor

can u please try 496 as its also built

@ciju-juniper
Copy link
Contributor Author

ciju-juniper commented Nov 20, 2020

@vdahiya12 Downloading build# 496. Once it's complete, I will load it and share the results.

@ciju-juniper
Copy link
Contributor Author

@vdahiya12 PMON crash is seen in build# 496 also. Please let me know which image will have your fix.

@yxieca
Copy link
Contributor

yxieca commented Nov 21, 2020

While this issue might have been fixed. the latest master image has a different issue with xcvrd as in issue #5994

@ciju-juniper
Copy link
Contributor Author

@vdahiya12 @lguohan @jleveque @yxieca I tried build# 500. Most of the services are down. Still seeing the following error

Nov 23 13:23:42.710367 sonic INFO pmon#/supervisord: start sonic-platform package not installed, attempting to install...
Nov 23 13:23:42.710367 sonic INFO pmon#/supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py2-none-any.whl
Nov 23 13:23:43.041006 sonic INFO pmon#/supervisord: start sonic-platform package not installed, attempting to install...
Nov 23 13:23:43.041156 sonic INFO pmon#/supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py3-none-any.whl
Nov 23 13:23:43.314650 sonic INFO pmon#/supervisord: xcvrd Traceback (most recent call last):
Nov 23 13:23:43.314650 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/bin/xcvrd", line 5, in <module>
Nov 23 13:23:43.314650 sonic INFO pmon#/supervisord: xcvrd     from xcvrd.xcvrd import main
Nov 23 13:23:43.314650 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/lib/python2.7/dist-packages/xcvrd/xcvrd.py", line 26, in <module>
Nov 23 13:23:43.315162 sonic INFO pmon#/supervisord: xcvrd     raise ImportError (str(e) + " - required module not found")
Nov 23 13:23:43.315162 sonic INFO pmon#/supervisord: xcvrd ImportError: No module named sonic_platform.platform - required module not found - required module not found - required module not found
Also seeing the errors from eepromd & psud
Nov 23 13:24:22.052834 sonic WARNING pmon#syseepromd[29]: Failed to load platform-specific eeprom from sonic_platform package due to ImportError('No module named sonic_platform',)
Nov 23 13:24:22.058782 sonic WARNING pmon#psud[28]: Failed to load chassis due to ImportError('No module named sonic_platform.platform',)

Also seeing an orchagent crash:

Nov 23 13:29:58.513505 sonic ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on notify
Nov 23 13:29:58.513505 sonic ERR swss#orchagent: :- wait: failed to get response for notify
Nov 23 13:29:58.513505 sonic ERR swss#orchagent: :- initSaiRedis: Failed to notify syncd INIT_VIEW, rv:-1 gSwitchId 0

@ciju-juniper
Copy link
Contributor Author

Here is the dump from build# 500:
sonic_dump_sonic_20201123_133333.tar.gz

@jleveque
Copy link
Contributor

@ciju-juniper: It appears as though your sonic_platform wheel package was not found when the PMon container started up, so it was not installed in the container. Can you please investigate why?

@lguohan
Copy link
Collaborator

lguohan commented Nov 23, 2020

@jleveque
Copy link
Contributor

@jleveque, which package are you referring to. I do not see they build their own wheel package. https://sonic-jenkins.westus2.cloudapp.azure.com/job/broadcom/job/buildimage-brcm-all/lastSuccessfulBuild/artifact/target/debs/buster/sonic-platform-juniper-qfx5210_1.1_amd64.deb.log

I see references to files in the sonic_platform package in the log, but I don't see a sonic_platform-1.0-py2-none-any.whl package being built.

@ciju-juniper
Copy link
Contributor Author

@jleveque I started seeing that error from build# 482. No platform changes are involved.

@ciju-juniper
Copy link
Contributor Author

@jleveque Moreover, the images are built by jenkins jobs. Were there any changes in the package selection / build rules from build# 482 onwards?

@jleveque
Copy link
Contributor

@ciju-juniper: You can see the commits which went into build # 482 here: https://sonic-jenkins.westus2.cloudapp.azure.com/job/broadcom/job/buildimage-brcm-all/482/

There are no changes to package selection/build rules that I can see.

@daall
Copy link
Contributor

daall commented Dec 16, 2020

@ciju-juniper are you still encountering this issue?

@ciju-juniper
Copy link
Contributor Author

@daall Let me start downloading the latest master image. I will update this issue shortly.

@ciju-juniper
Copy link
Contributor Author

@daall I tried with build# 522. Problem is still seen. PMON is exited after several startup attempts.

Dec 16 19:05:30 sonic determine-reboot-cause: sonic_platform package not installed. Unable to detect hardware reboot causes.
Dec 16 19:05:30 sonic determine-reboot-cause: sonic_platform package not installed. Unable to detect hardware reboot causes.
Dec 16 19:05:40.453829 sonic INFO watchdog-control.sh[2100]:     import sonic_platform
Dec 16 19:05:40.453977 sonic INFO watchdog-control.sh[2100]: ModuleNotFoundError: No module named 'sonic_platform'
Dec 16 19:05:40.455009 sonic INFO watchdog-control.sh[2100]: ImportError: No module named 'sonic_platform' - required module not found
Dec 16 19:05:43.362070 sonic INFO pmon#/supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py2-none-any.whl
Dec 16 19:05:43.771003 sonic INFO pmon#/supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py3-none-any.whl
Dec 16 19:05:44.316541 sonic INFO pmon#/supervisord: xcvrd ImportError: No module named sonic_platform.platform - required module not found - required module not found - required module not found
Dec 16 19:05:44.319433 sonic WARNING pmon#syseepromd[29]: Failed to load platform-specific eeprom from sonic_platform package due to ImportError('No module named sonic_platform',)
Dec 16 19:05:44.357641 sonic WARNING pmon#psud[28]: Failed to load chassis due to ImportError('No module named sonic_platform.platform',)
Dec 16 19:06:24.575275 sonic INFO pmon#/supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py2-none-any.whl
Dec 16 19:06:24.860707 sonic INFO pmon#/supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py3-none-any.whl
Dec 16 19:06:25.147903 sonic WARNING pmon#psud[28]: Failed to load chassis due to ImportError('No module named sonic_platform.platform',)
Dec 16 19:06:25.198652 sonic WARNING pmon#syseepromd[29]: Failed to load platform-specific eeprom from sonic_platform package due to ImportError('No module named sonic_platform',)

Please let me know, if you have seen similar issues and any suggestions to rectify it.

@ciju-juniper
Copy link
Contributor Author

ciju-juniper commented Dec 16, 2020

@daall I did some debugging by selectively enabling syseepromd, & psud. Without enabling, xcvrd, pmon is starting up. The 'No module named sonic_platform' error is still there. It becomes problematic when the xcvrd is started.

This is what I get when I manually start xcvrd from pmon:

root@sonic:/var/log/supervisor# xcvrd 
Traceback (most recent call last):
  File "/usr/local/bin/xcvrd", line 5, in <module>
    from xcvrd.xcvrd import main
  File "/usr/local/lib/python2.7/dist-packages/xcvrd/xcvrd.py", line 26, in <module>
    raise ImportError(str(e) + " - required module not found")
ImportError: No module named sonic_platform.platform - required module not found - required module not found - required module not found

This sonic_platform is imported in the xcvrd daemon init:

1216     # Initialize daemon
1217     def init(self):
1218         global platform_sfputil
1219         global platform_chassis
1220 
1221         self.log_info("Start daemon init...")
1222 
1223         # Load new platform api class
1224         try:
1225             import sonic_platform.platform

It's clear that sonic_platform library is not available for xcvrd to start.

In the last good build# 481 (in master), I see that xcvrd is available at /usr/bin/xcvrd.
In the latest master images, I see that xcvrd is moved to a library: /usr/local/lib/python2.7/dist-packages/xcvrd/xcvrd.py

Look like something is broken in the library packaging / initialization.

@ciju-juniper
Copy link
Contributor Author

I found out what's happening. After the xcvrd code moved to the python library, it started creating conflict with 'sonic_platform' platform library implementation (that contains chassis.py & platform.py).

As an experiment, I removed the 'sonic_platform' package from platform directory and built an image. There were no crashes of xcvrd and PMON docker is fine.

I do see these messages:

  1. Dec 21 04:56:46.996169 sonic WARNING determine-reboot-cause: sonic_platform package not installed. Unable to detect hardware reboot causes.

  2. Dec 21 04:56:51.297405 sonic WARNING healthd[1092]: sonic_platform package not installed. Cannot start system-health daemon

  3. Dec 21 04:57:00.336922 sonic INFO watchdog-control.sh[1914]: import sonic_platform
    Dec 21 04:57:00.337071 sonic INFO watchdog-control.sh[1914]: ModuleNotFoundError: No module named 'sonic_platform'
    Dec 21 04:57:00.338122 sonic INFO watchdog-control.sh[1914]: ImportError: No module named 'sonic_platform' - required module not found

  4. Dec 21 04:57:35.408520 sonic INFO pmon#/supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py2-none-any.whl
    Dec 21 04:57:35.743375 sonic INFO pmon#/supervisord: start Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py3-none-any.whl

  5. Dec 21 04:57:36.091278 sonic WARNING pmon#syseepromd[29]: Failed to load platform-specific eeprom from sonic_platform package due to ImportError('No module named sonic_platform',)

  6. Dec 21 04:57:36.096212 sonic WARNING pmon#psud[28]: Failed to load chassis due to ImportError('No module named sonic_platform.platform',)

  7. Dec 21 04:57:36.097218 sonic WARNING pmon#sonic_y_cable: Failed to load chassis due to NameError("name 'sonic_platform' is not defined",)

Out of these errors, 'reboot_cause' error is expected as the platform implementation was in chassis.py.

Despite of having all these errors, psud, syseepromd are running fine. I'm not sure about the ImportError('No module named sonic_platform',).

@daall @jleveque @vdahiya12 What would be your suggestion to get rid of these errors?

@ciju-juniper
Copy link
Contributor Author

@lguohan @jleveque @vdahiya12 @daall I took a deeper look at the error 6 listed in the above comment. This is due to the changes introduced for supporting chassis based systems. Similar errors are there in the syseepromd, watchdog, xcvrd, etc.

# Run daemon
    def run(self):
        global platform_psuutil
        global platform_chassis
        self.log_info("Starting up...")
        # Load new platform api class
        try:
            import sonic_platform.platform
            platform_chassis = sonic_platform.platform.Platform().get_chassis() =====> This will fail for existing pizza box types
        except Exception as e:
            self.log_warning("Failed to load chassis due to {}".format(repr(e)))   =====> This is the message in error 6

This is mandating the implementation of 'sonic_platform.platform.Platform().get_chassis()' . Why is the new implementation is done without having backward compatibility?

And the subsequent code block ensures that psud will be functional for the pizza box types. This explains, how the PMON daemons are running even after getting a 'No module' error.

# Load platform-specific psuutil class
        if platform_chassis is None:
            try:
                platform_psuutil = self.load_platform_util(PLATFORM_SPECIFIC_MODULE_NAME, PLATFORM_SPECIFIC_CLASS_NAME)
            except Exception as e:
                self.log_error("Failed to load psuutil: %s" % (str(e)), True)
                sys.exit(PSUUTIL_LOAD_ERROR)

I'm OK to make any changes in the platform scripts to get rid of this error, but IMHO, the platform common implementation could have been better. Thoughts?

@jleveque
Copy link
Contributor

@ciju-juniper: The old platform plugins are deprecated, and all vendors should move to the new sonic_platform platform API. All daemons which previously worked with the old plugins were written with backwards-compatibility, but new daemons only reference the new sonic_platform package. In the near future, we will remove support for the old plugins entirely.

Please see https://github.com/Azure/SONiC/wiki/Porting-Guide

@abhiranjeet
Copy link

Hi, I am using an Edgecore switch with SONiC on it. But it seems that the same problem is repeating here too : PMON crashes. I built this image from the master branch. Is there any branch which has a fix for it ?
Including docker logs for the crashed container :

sonic-platform package not installed, attempting to install...
Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py2-none-any.whl
sonic-platform package not installed, attempting to install...
Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py3-none-any.whl
/usr/local/lib/python3.7/dist-packages/supervisor/options.py:474: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  'Supervisord is running as root and it is searching '
2021-07-13 13:18:53,444 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2021-07-13 13:18:53,444 INFO Set uid to user 0 succeeded
2021-07-13 13:18:53,454 INFO RPC interface 'supervisor' initialized
2021-07-13 13:18:53,454 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2021-07-13 13:18:53,455 INFO supervisord started with pid 1
2021-07-13 13:18:54,459 INFO spawned: 'dependent-startup' with pid 29
2021-07-13 13:18:54,463 INFO spawned: 'supervisor-proc-exit-listener' with pid 30
2021-07-13 13:18:55,983 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-07-13 13:18:55,984 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-07-13 13:18:56,007 INFO spawned: 'rsyslogd' with pid 33
2021-07-13 13:18:57,178 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-07-13 13:18:58,232 INFO spawned: 'chassis_db_init' with pid 37
2021-07-13 13:18:58,234 INFO success: chassis_db_init entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-07-13 13:18:58,246 INFO spawned: 'lm-sensors' with pid 38
2021-07-13 13:18:58,249 INFO success: lm-sensors entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-07-13 13:18:58,276 INFO spawned: 'fancontrol' with pid 41
2021-07-13 13:18:58,307 INFO spawned: 'xcvrd' with pid 43
2021-07-13 13:18:58,311 INFO success: xcvrd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-07-13 13:18:58,358 INFO spawned: 'psud' with pid 56
2021-07-13 13:18:58,359 INFO success: psud entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-07-13 13:18:58,400 INFO spawned: 'syseepromd' with pid 62
2021-07-13 13:18:58,449 INFO spawned: 'pcied' with pid 66
2021-07-13 13:18:58,529 INFO exited: psud (exit status 1; not expected)
2021-07-13 13:18:58,588 INFO spawned: 'psud' with pid 76
2021-07-13 13:18:58,589 INFO success: psud entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-07-13 13:18:58,653 WARN received SIGTERM indicating exit request
2021-07-13 13:18:58,653 INFO waiting for dependent-startup, supervisor-proc-exit-listener, rsyslogd, chassis_db_init, lm-sensors, fancontrol, xcvrd, psud, syseepromd, pcied to die
2021-07-13 13:18:58,671 INFO stopped: pcied (terminated by SIGTERM)
2021-07-13 13:18:58,679 INFO stopped: syseepromd (terminated by SIGTERM)
2021-07-13 13:18:58,684 INFO stopped: psud (terminated by SIGTERM)
2021-07-13 13:18:58,688 INFO stopped: xcvrd (terminated by SIGTERM)
2021-07-13 13:18:58,690 INFO stopped: fancontrol (terminated by SIGTERM)
2021-07-13 13:18:58,692 INFO stopped: lm-sensors (terminated by SIGTERM)
2021-07-13 13:18:58,695 INFO reaped unknown pid 85 (exit status 1)
2021-07-13 13:18:58,733 INFO exited: dependent-startup (exit status 3; expected)
2021-07-13 13:18:58,734 INFO stopped: chassis_db_init (terminated by SIGTERM)
2021-07-13 13:18:58,734 INFO reaped unknown pid 80 (exit status 0)
2021-07-13 13:18:58,756 INFO stopped: rsyslogd (exit status 0)
2021-07-13 13:18:58,760 INFO stopped: supervisor-proc-exit-listener (terminated by SIGTERM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants