Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for SMART logs are no longer collected #92

Open
lahwaacz opened this issue Oct 30, 2022 · 9 comments
Open

Metrics for SMART logs are no longer collected #92

lahwaacz opened this issue Oct 30, 2022 · 9 comments

Comments

@lahwaacz
Copy link
Contributor

The --xall parameter was removed in e884420, but no --log parameter was added. Hence, the smartctl_device_num_err_log_entries, smartctl_device_error_log_count, smartctl_device_self_test_log_count, and smartctl_device_self_test_log_error_count metrics stay empty as the smartctl does not report the relevant data to the exporter.

@tekert
Copy link
Contributor

tekert commented Jul 23, 2023

Adding --log=error --log=selftest Seems to fix this, tested it on master, but i don't know if this wakes up drives, it shouldn't according to doc.

If someone wants to test is, execute this on a device that is in sleep status:
smartctl --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error --log=selftest <device>

--format=brief is quite redundant with --json

tekert added a commit to tekert/smartctl_exporter that referenced this issue Jul 23, 2023
@koebbe
Copy link

koebbe commented Oct 2, 2023

Yeah, when we have a drive that fails a self test, it seems that without --log=selftest, there's no way for us to know that an otherwise fine drive has had any problem.

At the very least, in cases like this, it would be nice to get a smartctl.exit_status of something other than zero.

With --log=selftest included, we get an exit_status of 128.

@robbat2
Copy link
Contributor

robbat2 commented Oct 16, 2023

Even with --nocheck=never, on a sample drive that's loaded to 100% IO, smartctl returns different output with & without the --xall command..

We need to bring back the --xall to get correct information to get the fields populated.

Here's the JSON with and without --xall using --nocheck=never. (changing nocheck in this case doesn't have an effect , this drive is never idle due to it's workload). Diff included for ease of review.

smartctl-_dev_sdb-info.health.attributes.tolerance_verypermissive.nocheck_never.format_brief.log_error.xall.json
smartctl-_dev_sdb-info.health.attributes.tolerance_verypermissive.nocheck_never.format_brief.log_error.json

@robbat2
Copy link
Contributor

robbat2 commented Oct 16, 2023

smartctl-xall-json.patch.gz
Sorry, GitHub would not let me attach the patch unless I compressed it.

@robbat2
Copy link
Contributor

robbat2 commented Dec 7, 2023

@SuperQ @NiceGuyIT should we consider this a smartctl bug or a tradeoff we have to ask users to make?

If users want these metrics, they have to consider that the metrics might wake a drive and prevent idle.

@NiceGuyIT
Copy link
Member

@robbat2 I hope to do a deep dive into this and a few other issues before or around the holidays. This issue might be related to #152 which was caused by PR #131 that introduced --log=error. Since you added the Python script to save a redacted version of smartctl, I was going to modify that to compare the difference between the smartctl switches so that we can make a logical step forward. I'd rather not play wack-a-mole with the smartctl switches. If there's a tradeoff, it can be documented and left up to the user, while at the same time reported upstream to see what Smartmontools thinks.

@intelfx
Copy link

intelfx commented Apr 27, 2024

OK, so I was directed here from #190. Seeing as this issue is 1.5 years old, what is the verdict here? As it stands, smartctl_exporter as a project is more or less useless because it fails to collect most of the actually interesting metrics.

@robbat2
Copy link
Contributor

robbat2 commented Apr 28, 2024

@NiceGuyIT did you make any progress on it? On all of the drives I tried, it seems there's less data without --xall.
I think we should introduce an exporter option like --wake-drives-for-more-data that enables the --xall option to smartctl, and then the output will be fine. Just document it as a potentially waking drives (most of the fleet I care about is never idle anyway).

@kinghrothgar
Copy link

So does current releases still no longer report Self-test has failed in anyway? This seems like a very important feature and in fact is the whole reason I am looking for a prometheus exporter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants