Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ssd] Allow individual vendor parsers to handle errors #252

Merged
merged 1 commit into from
Jan 19, 2022

Conversation

alexrallen
Copy link
Contributor

Description

Removed a return None in the command execution helper function _execute_shell in the SSD class so that command output is still parsed in the event of a non-zero exit code.

Motivation and Context

If there is an issue found with a SSD SMART utilities may return a non-zero error code and we would still want to parse the output so we should not bail.

In addition, with the return None the parsing functions throw stack traces during execution. When removed, the parsing functions are able to elegantly handle missing information and the fields are left as N/A so there is no need for handling here.

How Has This Been Tested?

Tested manually on a SN2010 with a malfunctioning SSD which was initially throwing stack traces and now handles the error elegantly reporting the health as N/A a much more informative error than a python stack trace.

@sujinmkang
Copy link
Contributor

@alexrallen What is the type of the disk? Can you please share how to test and what is the output? do we have any subsequent changes following this change?

@alexrallen
Copy link
Contributor Author

The code I am changing is related to the "generic" disk code.

To test this you will need to have a switch where SMART is throwing a non-zero return code. Then just attempt to get the ssd health through the "show" command.

No related changes, this code I am removing was redundant, It does not need to be replaced as the error handling is elsewhere.

@sujinmkang
Copy link
Contributor

@alexrallen The previous change actually make the underlying problem come out to the surface. And we noticed the ssdutil throw exceptions when the disk type is hdd. Instead of hiding the problem, it could be better to find the root cause of the exceptions.
Please refer to sonic-net/sonic-utilities#1957. I am still working on it.

@alexrallen
Copy link
Contributor Author

@sujinmkang I am not sure what you are asking here, can you clarify?

The change I am proposing here is so that a non-zero return code from smartctl does not keep the system from parsing the output which could still be valid and usable.

The precise case this is an issue with is when the SSD fails a smartctl check, the user should still know about it.

I am not seeing how your change is related, can you elaborate a bit further?

@sujinmkang
Copy link
Contributor

@alexrallen What is the exit code from the malfunctioning SSD? What is the output of smatctl? My change is to fix the python stack trace when the disk type is not ssd and smartctl cannot retrieve the health data and returns none.

@alexrallen
Copy link
Contributor Author

The exit code is non-zero and smartctl throws an error that it cannot read specific health metrics from the SSD (I don't have the specific message available)

Otherwise smartctl gives all of the normal output which can be parsed.

@alexrallen
Copy link
Contributor Author

alexrallen commented Jan 6, 2022

@sujinmkang I looked over your PR closely. Your PR will only cover instances when the device is not of type SSD. My change will cover cases in which it IS an SSD but there is a SMART error with it.

Both of these changes should be able to be implemented I believe.

@sujinmkang
Copy link
Contributor

@alexrallen I think this change will also cover the non SSD disk failure case.

@sujinmkang sujinmkang merged commit a8a83e9 into sonic-net:master Jan 19, 2022
judyjoseph pushed a commit that referenced this pull request Jan 23, 2022
Description
Removed a return None in the command execution helper function _execute_shell in the SSD class so that command output is still parsed in the event of a non-zero exit code.

Motivation and Context
If there is an issue found with a SSD SMART utilities may return a non-zero error code and we would still want to parse the output so we should not bail.

In addition, with the return None the parsing functions throw stack traces during execution. When removed, the parsing functions are able to elegantly handle missing information and the fields are left as N/A so there is no need for handling here.

How Has This Been Tested?
Tested manually on a SN2010 with a malfunctioning SSD which was initially throwing stack traces and now handles the error elegantly reporting the health as N/A a much more informative error than a python stack trace.
qiluo-msft pushed a commit that referenced this pull request Mar 31, 2022
Description
Removed a return None in the command execution helper function _execute_shell in the SSD class so that command output is still parsed in the event of a non-zero exit code.

Motivation and Context
If there is an issue found with a SSD SMART utilities may return a non-zero error code and we would still want to parse the output so we should not bail.

In addition, with the return None the parsing functions throw stack traces during execution. When removed, the parsing functions are able to elegantly handle missing information and the fields are left as N/A so there is no need for handling here.

How Has This Been Tested?
Tested manually on a SN2010 with a malfunctioning SSD which was initially throwing stack traces and now handles the error elegantly reporting the health as N/A a much more informative error than a python stack trace.
liat-grozovik pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Apr 7, 2022
Update sonic-platform-common submodule to pickup new commits:
01512ec [SSD]Enhance ssd_generic with more error handling to avoid python crash sonic-net/sonic-platform-common#271
ac3e7f1 [y_cable][Broadcom] update the BRCM y_cable driver to release 2.0 sonic-net/sonic-platform-common#263
573717a [Credo][Ycable] Fix Credo firmware download API download_firmware flag sonic-net/sonic-platform-common#269
a844f18 [xcvr] Add get_module_fw_info method to XcvrApi class. sonic-net/sonic-platform-common#267
35bad16 [sfputil]Refactoring read_porttab_mappings sonic-net/sonic-platform-common#264
83c4345 [SSD Generic] Add support for parsing nvme ssd model, health and temperature sonic-net/sonic-platform-common#265
5da31e1 [ycable][credo] Fix the is_link_active API for Credo Ycable sonic-net/sonic-platform-common#260
931c6ea [Y-Cable][Credo] add theading locker to support thread-safe calling, add SKU check for download_firmware API. sonic-net/sonic-platform-common#222
ff3aa75 Fix SFF8472 Enhanced Options sonic-net/sonic-platform-common#259
a8a83e9 [ssd] Allow individual vendor parsers to handle errors sonic-net/sonic-platform-common#252

Signed-off-by: Kebo Liu <kebol@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants