Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mellanox]SysfsNotExistError: /var/run/hw-management/system/reset_from_asic not exist #7181

Closed
ZhaohuiS opened this issue Jan 5, 2023 · 6 comments

Comments

@ZhaohuiS
Copy link
Contributor

ZhaohuiS commented Jan 5, 2023

Description
platform_tests.mellanox.test_reboot_cause.test_reboot_cause failed on 2700.

platform_tests/mellanox/test_reboot_cause.py::test_reboot_cause[str-msn2700-02-cpu] 
-------------------------------- live log call ---------------------------------
17:36:28 __init__.pytest_runtest_call             L0040 ERROR  | Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python.py", line 1464, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 81, in get_result
    _reraise(*ex)  # noqa
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python.py", line 174, in pytest_pyfunc_call
    testfunction(**testargs)
  File "/azp/_work/46/s/tests/platform_tests/mellanox/test_reboot_cause.py", line 32, in test_reboot_cause
    mocker.mock_reset_from_comex()
  File "/azp/_work/46/s/tests/platform_tests/mellanox/mellanox_thermal_control_test_helper.py", line 1308, in mock_reset_from_comex
    self.mock_helper.mock_value(self.RESET_FROM_COMEX, 1)
  File "/azp/_work/46/s/tests/platform_tests/mellanox/mellanox_thermal_control_test_helper.py", line 225, in mock_value
    raise SysfsNotExistError('{} not exist'.format(file_path))
SysfsNotExistError: /var/run/hw-management/system/reset_from_comex not exist

These 3 files don't exist, which cause platform_tests.mellanox.test_reboot_cause.test_reboot_cause case failed.

class RebootCauseMocker(object):
    RESET_RELOAD_BIOS = '/var/run/hw-management/system/reset_reload_bios'
    RESET_FROM_COMEX = '/var/run/hw-management/system/reset_from_comex'
    RESET_FROM_ASIC = '/var/run/hw-management/system/reset_from_asic'

Another test case test_system_health failed for similar reason, where
SysfsNotExistError: /run/hw-management/thermal/psu1_temp_max not exist

system_health/test_system_health.py::test_device_checker[str2-msn4600c-acs-04] 
-------------------------------- live log call ---------------------------------
15:55:58 __init__.pytest_runtest_call             L0040 ERROR  | Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python.py", line 1464, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 81, in get_result
    _reraise(*ex)  # noqa
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python.py", line 174, in pytest_pyfunc_call
    testfunction(**testargs)
  File "/azp/_work/48/s/tests/system_health/test_system_health.py", line 226, in test_device_checker
    psu_mock_result, psu_name = device_mocker.mock_psu_temperature(False)
  File "/azp/_work/48/s/tests/system_health/mellanox/mellanox_device_mocker.py", line 121, in mock_psu_temperature
    threshold = self.psu_data.get_psu_temperature_threshold()
  File "/azp/_work/48/s/tests/system_health/mellanox/mellanox_device_mocker.py", line 55, in get_psu_temperature_threshold
    value = self.helper.read_value(threshold_file)
  File "/azp/_work/48/s/tests/platform_tests/mellanox/mellanox_thermal_control_test_helper.py", line 247, in read_value
    raise SysfsNotExistError('{} not exist'.format(file_path))
SysfsNotExistError: /run/hw-management/thermal/psu1_temp_max not exist

Steps to reproduce the issue:

  1. Run platform_tests.mellanox.test_reboot_cause.test_reboot_cause
  2. Some 4600 have these system files, but for some 2700, files are not found. How to generate these files for some devices?
  3. Or should this case be skipped on 2700?

Describe the results you received:

admin@str-msn2700-02:/var/run/hw-management/system$ ls -al | grep reset_from_asic
admin@str-msn2700-02:/var/run/hw-management/system$ 

Describe the results you expected:

admin@str2-msn4600c-acs-04:/var/run/hw-management/system$ ls -al | grep reset_from_asic
lrwxrwxrwx  1 root root  68 Jan  5 06:08 reset_from_asic -> /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon1/reset_from_asic

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@ZhaohuiS
Copy link
Contributor Author

ZhaohuiS commented Jan 5, 2023

@echuawu
These 3 files are added in your PR #6944.
Could you please take a look at it? Thanks.

@echuawu
Copy link
Contributor

echuawu commented Jan 5, 2023

You shall skip these nodes, they are not support this feature:
'x86_64-mlnx_msn2010-r0',
'x86_64-mlnx_msn2700-r0',
'x86_64-mlnx_msn2100-r0',
'x86_64-mlnx_msn2410-r0',
'x86_64-nvidia_sn2201-r0'

@ZhaohuiS
Copy link
Contributor Author

ZhaohuiS commented Jan 5, 2023

@echuawu Thank you for your suggestion.
I skipped these platforms in #7183, please review it, thanks.

BTW, there is another test case test_system_health::test_device_checker failed for similar reason,
SysfsNotExistError: /run/hw-management/thermal/psu1_temp_max not exist

There are 2 x86_64-mlnx_msn4600c-r0 testbeds, one has psu1_temp_max but the other doesn't.
Do you know how to generate this file? Their firmware status are same.

admin@str2-msn4600c-acs-04:/run/hw-management/thermal$ ls -al | grep psu1_temp_max 
admin@str2-msn4600c-acs-04:/run/hw-management/thermal$ show platform sum
Platform: x86_64-mlnx_msn4600c-r0
HwSKU: ACS-MSN4600C
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2115X26247
Model Number: MSN4600-CS2RO
Hardware Revision: A1
admin@str2-msn4600c-acs-01:/run/hw-management/thermal$ ls -al | grep psu1_temp_max 
lrwxrwxrwx  1 root root   85 Jan  5 04:14 psu1_temp_max -> /sys/devices/platform/mlxplat/i2c_mlxcpld.1/i2c-1/i2c-4/4-0059/hwmon/hwmon4/temp1_max
lrwxrwxrwx  1 root root   91 Jan  5 04:14 psu1_temp_max_alarm -> /sys/devices/platform/mlxplat/i2c_mlxcpld.1/i2c-1/i2c-4/4-0059/hwmon/hwmon4/temp1_max_alarm
admin@str2-msn4600c-acs-01:/run/hw-management/thermal$ show platform sum
Platform: x86_64-mlnx_msn4600c-r0
HwSKU: ACS-MSN4600C
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2115X26246
Model Number: MSN4600-CS2RO
Hardware Revision: A1

@echuawu
Copy link
Contributor

echuawu commented Jan 6, 2023

Sorry, I am not familiar with the generation.

@ZhaohuiS
Copy link
Contributor Author

The issue is fixed now in #7183.

@echuawu Thank you for your suggestion. I skipped these platforms in #7183, please review it, thanks.

BTW, there is another test case test_system_health::test_device_checker failed for similar reason, SysfsNotExistError: /run/hw-management/thermal/psu1_temp_max not exist

There are 2 x86_64-mlnx_msn4600c-r0 testbeds, one has psu1_temp_max but the other doesn't. Do you know how to generate this file? Their firmware status are same.

admin@str2-msn4600c-acs-04:/run/hw-management/thermal$ ls -al | grep psu1_temp_max 
admin@str2-msn4600c-acs-04:/run/hw-management/thermal$ show platform sum
Platform: x86_64-mlnx_msn4600c-r0
HwSKU: ACS-MSN4600C
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2115X26247
Model Number: MSN4600-CS2RO
Hardware Revision: A1
admin@str2-msn4600c-acs-01:/run/hw-management/thermal$ ls -al | grep psu1_temp_max 
lrwxrwxrwx  1 root root   85 Jan  5 04:14 psu1_temp_max -> /sys/devices/platform/mlxplat/i2c_mlxcpld.1/i2c-1/i2c-4/4-0059/hwmon/hwmon4/temp1_max
lrwxrwxrwx  1 root root   91 Jan  5 04:14 psu1_temp_max_alarm -> /sys/devices/platform/mlxplat/i2c_mlxcpld.1/i2c-1/i2c-4/4-0059/hwmon/hwmon4/temp1_max_alarm
admin@str2-msn4600c-acs-01:/run/hw-management/thermal$ show platform sum
Platform: x86_64-mlnx_msn4600c-r0
HwSKU: ACS-MSN4600C
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2115X26246
Model Number: MSN4600-CS2RO
Hardware Revision: A1

This is caused by unhealthy PDU, replace PDU, it works.

@ZhaohuiS
Copy link
Contributor Author

The issue is fixed in #7183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants