Emphasis on "inexpensive" (geekery)
by at 8:22 AM
So, last night one of the hard drives in my RAID died. That wouldn't be so bad except that shortly after, another one died too. They were ones that I'd installed at the same time, from the same production batch. Western Digital WD20EARS. So of course this invalidates my entire RAID.
I'm making a mental inventory of what was on there that I care about. Fortunately I have my offsite backup, and also most of the stuff on there wasn't irreplaceable (everything that was I had multiple redundant backups of), and even if I did lose it all, it's nothing I have any major sentimental attachment to.
Right now my office is so quiet my ears are ringing.
Comments
I'm up to 4 drive failures so far on my stack-o-WD20EARS, but the failures have all been non-catastrophic. Usually just a streak of sectors goes bad (and is corrected automatically by md), and then smartctl -t long reveals enough evidence to let me justifiably RMA the drive. The worst failure I had was a drive that went narcoleptic and dropped out of the array, but later woke up (with out-of-date stripe data; again, md handled it). Two of my failures happened back-to-back in March, and one of the March ones had a very close serial number to a failure I had in August, so there could be a bad batch going around.
Are your failed drives image-able, or are they totally hosed to the point where they won't spin up or identify? Are you sure it's not a controller issue? Are you sure it's not just md refusing to mount the array because it's dirty/degraded?
I've heard so many problems with the WD20EARS that the replacements I ordered are Seagate Barracudas, which cost the same amount. (I will of course RMA the WD20EARS and use the replacements as occasional-use external drives.)
Wow, that's specific.
Can you ssh in and see what the deal is? mdadm --detail /dev/md0, dmesg | grep md, smartctl -a /dev/sdwhatever
fluffy:
The phrase "out of the frying pan and into the fire" comes to mind. Barracudas are notorious for 1e14 error rates and, on some recent models, known firmware bugs.
Basically every drive sucks nowadays and you should just expect them to fail and prepare accordingly. Though, 4 drives on a RAID5 should have been fine as long as you minimize the window of opportunity for double failures. My gut feeling tells me this is the Synology being stupid rather than the drives actually dying.
/dev/md2:
Version : 1.01
Creation Time : Tue Jun 29 19:00:25 2010
Raid Level : raid5
Array Size : 5846342016 (5575.51 GiB 5986.65 GB)
Used Dev Size : 1948780672 (1858.50 GiB 1995.55 GB)
Raid Devices : 4
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Wed Apr 20 09:44:02 2011
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
Name : ds410j:2 (local to host ds410j)
UUID : 0c915330:873669c9:d4bd0c66:62a08972
Events : 5778
Number Major Minor RaidDevice State
0 8 5 0 active sync /dev/sda5
1 8 21 1 active sync /dev/sdb5
2 0 0 2 removed
4 8 53 3 active sync /dev/sdd5
smartctl is saying the drives "don't support SMART," which is weird because I know I've done SMART tests on them before.
Right now it looks like when drive 3 crashed it restriped the disk to be a RAID5 with only 3 volumes, and then when drive 4 crashed a few hours later it mounted the filesystem read-only. So all my files are accessible, at least. The DS410j assures me that if I just replace disk 4 ASAP everything will be fine, so for now I'm just going to shut it down until I get a replacement disk.
"Clean, degraded" means that it should be mounting okay, albeit in a critical state. You might need to issue an extra command to make it start the array if it's being extra-paranoid. I forget the specifics there.
As for smartctl, make sure you're doing it on the actual drives (/dev/sda) and not the partitions (/dev/sda5).
The Synology is probably doing something crazy in the background. It's set to "hybrid RAID" mode which quite specifically restripes things as drive availability changes (i.e. you can start out with three 1TB drives and end with four 2TB drives and if you only upgrade one drive at a time it will theoretically work out in the end).
I actually don't know when drive 3 failed, now, because what started me down this path was logging in to reboot it since I was having transitory Time Machine failures again (which usually get fixed by a reboot) and noticing that drive 3 had disappeared from the volume, and discovering that I'd forgotten to update the device monitoring alarm SMTP settings when they changed a few months ago. So who knows when and what happened.
Most of that 700GB is crap I definitely don't care about though. As long as my user directory finishes copying overnight I'll be happy, and that's only on the order of 300GB so should take... well, still around 12-16 hours. but still.
When I've gotten the RMA replacement drives, at least one of them is going onto my Mac mini, which I'm going to use as a local Crashplan store. More redundancy and more recovery options is always a good thing.
Try:
Not that it really matters now, since I have the RAID rebuilt. All I lost in the end was my TM archive, which is of course trivial to recover, sans version control anyway.
#1 The freebie USB code that they use was known not to have the best performance for external drives. Maybe its been improved, but I haven't followed that.
#2 When I buy drives for my Synology, I buy 5 or 6. 4 for the RAID. 1 for local external USB backup. 1 if I am flush. That way - if or when one dies in the RAID, I have another from that production batch to throw in right away.
so moral is. 4 drive raid - buy 5. 5 drive raid - buy 6 or more
tc