Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6 issues #13

Closed
kfox1111 opened this issue Oct 29, 2020 · 11 comments · Fixed by #18
Closed

0.6 issues #13

kfox1111 opened this issue Oct 29, 2020 · 11 comments · Fixed by #18
Assignees
Labels
bug Something isn't working

Comments

@kfox1111
Copy link

If I start it:

/ # smartctl_exporter 
[Warning] S.M.A.R.T. output reading error: exit status 4
[Warning] SMART status check returned 'DISK OK' but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past.
[Info] Starting on localhost:9633/metrics

If I access it:

/ # wget -O - http://127.0.0.1:9633/metrics
Connecting to 127.0.0.1:9633 (127.0.0.1:9633)
wget: error getting response

Back at the logs:

goroutine 21 [running]:
main.(*SMARTctlInfo).mineVersion(0xc000072ec0)
	/go/smartctl_exporter-smartctl_exporter_0.6/smartctlinfo.go:46 +0xb65
main.(*SMARTctlInfo).Collect(...)
	/go/smartctl_exporter-smartctl_exporter_0.6/smartctlinfo.go:35
main.SMARTctlManagerCollector.Collect(0xc000216180)
	/go/smartctl_exporter-smartctl_exporter_0.6/main.go:37 +0x1da
github.com/prometheus/client_golang/prometheus.(*wrappingCollector).Collect.func1(0xc0001b0270, 0xc000216180)
	/go/smartctl_exporter-smartctl_exporter_0.6/vendor/src/github.com/prometheus/client_golang/prometheus/wrap.go:129 +0x3d
created by github.com/prometheus/client_golang/prometheus.(*wrappingCollector).Collect
	/go/smartctl_exporter-smartctl_exporter_0.6/vendor/src/github.com/prometheus/client_golang/prometheus/wrap.go:128 +0x6b

Manually running smartctl

/ # smartctl -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.18.0-147.3.1.el8_1.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron MX500 SSDs
Device Model:     CT500MX500SSD1
Serial Number:    xxxxxxxxxxxx
LU WWN Device Id: 5 xxxxxx xxxxxx
Firmware Version: M3CR020
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct 29 22:16:30 2020 UTC

==> WARNING: This firmware returns bogus raw values in attribute 197

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0031)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1744
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       231
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   079   079   000    Old_age   Always       -       316
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       11
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       34
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   069   044   000    Old_age   Always       -       31 (Min/Max 0/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Bogus_Current_Pend_Sect 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   079   079   001    Old_age   Offline      -       21
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       36355139748
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       1528886862
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       3367684259

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Completed [00% left] (0-65535)
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
@kfox1111
Copy link
Author

0.5.1 prints:

/ # smartctl_exporter 
[Warning] S.M.A.R.T. output reading error: exit status 4
[Info] Starting on localhost:9633/metrics

But then works.

@Sheridan Sheridan self-assigned this Oct 30, 2020
@Sheridan Sheridan added the bug Something isn't working label Oct 30, 2020
@vagifzeynalov
Copy link

Hi @Sheridan!
First of all - thank you for your work! 👍

I'm having the same error as well

# bin/smartctl_exporter_git -debug -verbose -config /usr/local/etc/smartctl_exporter/smartctl_exporter.yaml
[Verbose] Read options from /usr/local/etc/smartctl_exporter/smartctl_exporter.yaml

[Debug] Parsed options: {{[192.168.xxx.xxx]:9633 /metrics %!s(bool=false) /usr/local/sbin/smartctl 60s 1m0s [/dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4 /dev/da5 /dev/da6 /dev/da7 /dev/da8 /dev/da9 /dev/da10 /dev/da11 /dev/da12 /dev/da13 /dev/da14 /dev/da15]}}
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da0
[Verbose] Collecting metrics from /dev/da0: Seagate IronWolf, ST3000VN007-2E4166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da1
[Verbose] Collecting metrics from /dev/da1: Seagate IronWolf, ST3000VN007-2E4166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da2
[Warning] S.M.A.R.T. output reading error: exit status 64
[Error] Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode
[Error] smartctl returned bad data for device /dev/da2
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da3
[Verbose] Collecting metrics from /dev/da3: Seagate IronWolf, ST3000VN007-2E4166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da4
[Verbose] Collecting metrics from /dev/da4: Seagate IronWolf, ST3000VN007-2E4166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da5
[Warning] S.M.A.R.T. output reading error: exit status 64
[Error] Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode
[Error] smartctl returned bad data for device /dev/da5
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da6
[Verbose] Collecting metrics from /dev/da6: Seagate IronWolf, ST3000VN007-2E4166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da7
[Verbose] Collecting metrics from /dev/da7: Seagate IronWolf, ST3000VN007-2E4166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da8
[Warning] S.M.A.R.T. output reading error: exit status 4
[Warning] SMART status check returned 'DISK OK' but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past.
[Verbose] Collecting metrics from /dev/da8: Seagate Barracuda 7200.14 (AF), ST3000DM001-1ER166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da9
[Warning] S.M.A.R.T. output reading error: exit status 4
[Warning] SMART status check returned 'DISK OK' but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past.
[Verbose] Collecting metrics from /dev/da9: Seagate Barracuda 7200.14 (AF), ST3000DM001-1ER166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da10
[Warning] S.M.A.R.T. output reading error: exit status 4
[Warning] SMART status check returned 'DISK OK' but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past.
[Verbose] Collecting metrics from /dev/da10: Seagate Barracuda 3.5, ST3000DM008-2DM166
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da11
[Verbose] Collecting metrics from /dev/da11: Seagate IronWolf Pro, ST10000NE0004-2GT11L
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da12
[Verbose] Collecting metrics from /dev/da12: Samsung based SSDs, Samsung SSD 850 PRO 256GB
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da13
[Verbose] Collecting metrics from /dev/da13: Samsung based SSDs, Samsung SSD 860 PRO 256GB
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da14
[Verbose] Collecting metrics from /dev/da14: Samsung based SSDs, Samsung SSD 850 PRO 256GB
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da15
[Verbose] Collecting metrics from /dev/da15: Seagate IronWolf Pro, ST10000NE0004-2GT11L
[Info] Starting on [192.168.xxx.xxx]:9633/metrics
[Error] Too early collect called for device /dev/da0
[Error] Too early collect called for device /dev/da1
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da2
[Warning] S.M.A.R.T. output reading error: exit status 64
[Error] Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode
[Error] smartctl returned bad data for device /dev/da2
[Error] Too early collect called for device /dev/da3
[Error] Too early collect called for device /dev/da4
[Debug] Collecting S.M.A.R.T. counters, device: /dev/da5
[Warning] S.M.A.R.T. output reading error: exit status 64
[Error] Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode
[Error] smartctl returned bad data for device /dev/da5
[Error] Too early collect called for device /dev/da6
[Error] Too early collect called for device /dev/da7
[Error] Too early collect called for device /dev/da8
[Error] Too early collect called for device /dev/da9
[Error] Too early collect called for device /dev/da10
[Error] Too early collect called for device /dev/da11
[Error] Too early collect called for device /dev/da12
[Error] Too early collect called for device /dev/da13
[Error] Too early collect called for device /dev/da14
[Error] Too early collect called for device /dev/da15
panic: runtime error: index out of range [0] with length 0

goroutine 29 [running]:
main.(*SMARTctlInfo).mineVersion(0xc000157ec0)
	/usr/local/etc/smartctl_exporter/smartctl_exporter_git/smartctlinfo.go:46 +0xb65
main.(*SMARTctlInfo).Collect(...)
	/usr/local/etc/smartctl_exporter/smartctl_exporter_git/smartctlinfo.go:35
main.SMARTctlManagerCollector.Collect(0xc00035c000)
	/usr/local/etc/smartctl_exporter/smartctl_exporter_git/main.go:37 +0x1da
github.com/prometheus/client_golang/prometheus.(*wrappingCollector).Collect.func1(0xc0001ba210, 0xc00035c000)
	/vendor/src/github.com/prometheus/client_golang/prometheus/wrap.go:127 +0x3d
created by github.com/prometheus/client_golang/prometheus.(*wrappingCollector).Collect
	/vendor/src/github.com/prometheus/client_golang/prometheus/wrap.go:126 +0x6b

Please let me know if you need more details.

Regards,
Vagif

@kfox1111
Copy link
Author

Any update on this?

@antifuchs
Copy link
Contributor

I've run into this also - it appears that if you wait the configured minimum-collection interval, smartctl_exporter does write metrics to the http endpoint.

But, as soon as you read before that (effectively) rate-limit, it panics, presumably because this line https://github.com/Sheridan/smartctl_exporter/blob/e27581d56ad80340fb076d3ce22cef337ed76679/readjson.go#L94 or the logic in https://github.com/Sheridan/smartctl_exporter/blob/e27581d56ad80340fb076d3ce22cef337ed76679/readjson.go#L79-L86 is wrong: Clearly, the first time you poll metrics, the tool should collect data... but it doesn't.

@kfox1111
Copy link
Author

Whats the default minimum-collection interval? Would blocking it from ready state in Kubernetes till then provide a workaround? I'm not sure prometheus honors a ready on a pod though.

@antifuchs
Copy link
Contributor

The default is 60s. I do think blocking reads would be a workaround, but pretty brittle. I'll try to see if I can fix this issue in a PR.

@kfox1111
Copy link
Author

Cool. Thanks.

@antifuchs
Copy link
Contributor

Done: #18 - hope this works for you too (:

@jangrewe
Copy link

jangrewe commented Nov 4, 2021

@Sheridan: Any change we can get a new release with this fix included? I'd also like to have the additional NVMe metrics ;-)

@mweinelt
Copy link

Maybe development could continue over at https://github.com/azrdev/smartctl_exporter which just forked and merged a few of the outstanding patches.

@k0ste
Copy link
Contributor

k0ste commented Feb 20, 2022

@mweinelt, I think better make some noise here, to achieve more "people with merge button" for this repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants