Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel ssd wearout not reported when almost dead #86

Open
pschonmann opened this issue Nov 16, 2022 · 9 comments
Open

Intel ssd wearout not reported when almost dead #86

pschonmann opened this issue Nov 16, 2022 · 9 comments

Comments

@pschonmann
Copy link

Similar as #73 .. Disk is failing now but not reported as crit
The Smart is

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1668
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       2
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2617 (2 65535)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   071   063   000    Old_age   Always       -       29 (Min/Max 19/38)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       29
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       7005511
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       8396
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       1
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       100130
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   092   092   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2617 (2 65535)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       7005511
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       71050
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       8255662

The disk info

=== START OF INFORMATION SECTION ===
Model Family:     Intel S4510/S4610/S4500/S4600 Series SSDs
Device Model:     INTEL SSDSC2KB240G8
Serial Number:    :)
LU WWN Device Id: 5 5cd2e4 151dfac3f
Firmware Version: XCV10110
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 16 14:54:42 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Output for plugin

./check_smart.pl -l -i auto -g '/dev/sd*[a-z]'
OK: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean|
./check_smart.pl -v
check_smart.pl v6.13.0

@Napsty
Copy link
Owner

Napsty commented Nov 16, 2022

How do you see that the drive is failing now? Any indicators, failures, logs, etc?

As you correctly mentioned, this is the same problem as the linked issue #73. check_smart currently can only read and interpret the "raw values". In this case, the plugin would need to read the "normalized values" which can either be an increasing or decreasing counter (this makes it even more tricky):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
233 Media_Wearout_Indicator 0x0032   092   092   000    Old_age   Always       -       0

@pschonmann
Copy link
Author

Same disks in raid1 both 1% lifetime and system is sooo slow. Write about 40M and loadavg about 80 on 6 core machine ( waiting for iops )
When disks replaced Everything works fine.

@Napsty
Copy link
Owner

Napsty commented Nov 16, 2022

Where do you see 1% lifetime in the SMART table?

@pschonmann
Copy link
Author

Sorry, i posted wrong smart
There is wrong values
SDA - 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0
SDB - 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0

@Napsty
Copy link
Owner

Napsty commented Nov 16, 2022

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0

So value 001 means 1% remaining? Is this one the replacement drive and has 92% remaining?

233 Media_Wearout_Indicator 0x0032   092   092   000    Old_age   Always       -       0

@pschonmann
Copy link
Author

pschonmann commented Nov 16, 2022

Yes, the atribute
233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0
is that failing disk and
in replaced disk, same model.
233 Media_Wearout_Indicator 0x0032 092 092 000 Old_age Always - 0

The number is decreasing from 100 ... the percent remaining.
Info
https://serverfault.com/questions/641558/media-wearout-indicator-at-043-reason-to-be-worried

@Napsty
Copy link
Owner

Napsty commented Nov 16, 2022

As the raw value remains 0, this is kinda tricky and cannot be easily integrated into the existing (raw) checks. We would have to add a new check with its own option (e.g. --ssd-wearout) which looks up the normalized value.
I don't see myself having time in the next weeks though. Code contributions are welcome :D

@pschonmann
Copy link
Author

Im absolutely fine with it. When it happens, it happens

@pschonmann
Copy link
Author

Tried to scan all our servers and here are values which can be reported as wear level in pct

177 Wear_Leveling_Count
233 Media_Wearout_Indicator
231 SSD_Life_Left
202 Percent_Lifetime_Remain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants