🏳️(TrueNAS) Is my drive dying and should be replaced?🏳️

rook@lemmy.zip · edit-2 19 days ago

🏳️(TrueNAS) Is my drive dying and should be replaced?🏳️

BakedCatboy@lemmy.ml · 19 days ago

I would replace it. Sometimes I push my luck and for minor or unexpected errors I just clear the error and re-add the drive, but this many errors is likely a solid sign.

frongt@lemmy.zip · 19 days ago

Not necessarily. I would shut the system down completely and check the drive connectors. If it’s on a backplane, try swapping slots, or if it’s breakout connector, swap it with another drive (and clear the zpool errors). If the errors start happening on the other drive, it’s a cable problem. If they continue on the same drive, it’s a drive problem. If they stop happening, it was a bad connection and it ought to be fine now.

That’s kind of a short output from smartctl -a, though. Shouldn’t it include the attribute data? I’d run a smart test (after doing the swap above) and see what it says.

On a raidz2, I wouldn’t be too concerned about losing a drive, but you should always be prepared to order a replacement if you value your data.

s38b35M5@lemmy.world · 19 days ago

I second this. SATA cables are cheaply made and can present issues that seem to indicate drive failure.

BigDaddySlim@lemmy.world · 19 days ago

Had this issue once, 2 drives kept not initializing during boot, rebooting a few times got them to register but showed drive errors. I thought either the drives or my SAS card was dying. Fully reseating the connectors fixed it and haven’t had an issue since.

frongt@lemmy.zip · 19 days ago

OP is using SAS, but it’s not too far from SATA.

s38b35M5@lemmy.world · edit-2 18 days ago

Good catch. I don’t usually see SAS as /dev/sd* so I assumed. Almost the same cables, though *usually better made.

non_burglar@lemmy.world · 19 days ago

zpool has very reasonable thresholds for disk failure being enough to kick it from the pool. I’ve seen pool members have a batch of bad blocks and ZFS still chugged along for a few years just avoiding those blocks before the disk finally failed.

Heed truenas here, replace the disk if you can.

Lemmchen@feddit.org · 19 days ago

Does this drive not have SMART values? Could be a loose cable for all we know (UDMA CRC errors).

SapphironZA@sh.itjust.works · edit-2 18 days ago

My rule for older hardware, before trusting the ZFS fault reporting, I would follow the following steps.

(Note these are homelabber steps and not what I would do in the enterprise, where risk and time is a lot more expensive than replacing hardware)

Check the Smart data of the drive. If it reports the drive as faulty, replace it.
Zpool clear the error and see if it comes back. Sometimes drive errors are not cause by the drive itself
Reseat the drive and the cables between the motherboard and the drive. Clear errors after this step. Especially with older hardware and it having travelled from its previous owner to you, something might not be seated properly.
Move the drive to another drive bay, or swap it with another drive. If the errors move with the drive, the drive is faulty. If the errors move to the bay, you probably have a good drive, but a faulty drive bay/cable.

SapphironZA@sh.itjust.works · 18 days ago

In addition to other advice, with only hardware, have 1 cold spare drive for every 2 years of remaining life of the hardware. It gets difficult to find similar spec drives the older they get. So if you want to use the drives in the NAS for another 4 years, get 2 spares. After that you start getting into the territory of replacing the drives anyway.

just_another_person@lemmy.world · 19 days ago

Just RMA it now. If it has SMART failures, you can provide the codes and they’ll replace it no problem.

Possibly linux@lemmy.zip · 19 days ago

Buy two replacements