RSS LJ

April 19, 2011

Emphasis on "inexpensive" ()

by fluffy at 8:22 AM

So, last night one of the hard drives in my RAID died. That wouldn't be so bad except that shortly after, another one died too. They were ones that I'd installed at the same time, from the same production batch. Western Digital WD20EARS. So of course this invalidates my entire RAID.

I'm making a mental inventory of what was on there that I care about. Fortunately I have my offsite backup, and also most of the stuff on there wasn't irreplaceable (everything that was I had multiple redundant backups of), and even if I did lose it all, it's nothing I have any major sentimental attachment to.

Right now my office is so quiet my ears are ringing.

Comments

#13891 04/20/2011 01:07 am
Holy crap. x_x Sorry to hear about this.

I'm up to 4 drive failures so far on my stack-o-WD20EARS, but the failures have all been non-catastrophic. Usually just a streak of sectors goes bad (and is corrected automatically by md), and then smartctl -t long reveals enough evidence to let me justifiably RMA the drive. The worst failure I had was a drive that went narcoleptic and dropped out of the array, but later woke up (with out-of-date stripe data; again, md handled it). Two of my failures happened back-to-back in March, and one of the March ones had a very close serial number to a failure I had in August, so there could be a bad batch going around.

Are your failed drives image-able, or are they totally hosed to the point where they won't spin up or identify? Are you sure it's not a controller issue? Are you sure it's not just md refusing to mount the array because it's dirty/degraded?
#13892 04/20/2011 07:17 am
I don't know the exact nature of the failures. The Synology just says the drives "crashed" but who knows what that actually means. I just have it turned off for now in the hopes that when a replacement drive arrives I can boot the array and transfer as much as I can to a replacement.

I've heard so many problems with the WD20EARS that the replacements I ordered are Seagate Barracudas, which cost the same amount. (I will of course RMA the WD20EARS and use the replacements as occasional-use external drives.)
#13893 04/20/2011 08:36 am
Oh, and from now on I think I'll configure my RAID for RAID1+0 and make sure that the drives in each of the RAID1 groups is as different as possible. I was only using 2TB of my 6TB total and most of that was just my Time Machine backups anyway.
#13894 04/20/2011 09:07 am
fluffy:
The Synology just says the drives "crashed"

Wow, that's specific.

Can you ssh in and see what the deal is? mdadm --detail /dev/md0, dmesg | grep md, smartctl -a /dev/sdwhatever

fluffy:
I've heard so many problems with the WD20EARS that the replacements I ordered are Seagate Barracudas

The phrase "out of the frying pan and into the fire" comes to mind. Barracudas are notorious for 1e14 error rates and, on some recent models, known firmware bugs.

Basically every drive sucks nowadays and you should just expect them to fail and prepare accordingly. Though, 4 drives on a RAID5 should have been fine as long as you minimize the window of opportunity for double failures. My gut feeling tells me this is the Synology being stupid rather than the drives actually dying.
#13895 04/20/2011 09:52 am
I'm not sure what to look for, but on my actual volume I get the following md details:

ds410j> mdadm --detail /dev/md2
/dev/md2:
        Version : 1.01
  Creation Time : Tue Jun 29 19:00:25 2010
     Raid Level : raid5
     Array Size : 5846342016 (5575.51 GiB 5986.65 GB)
  Used Dev Size : 1948780672 (1858.50 GiB 1995.55 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Wed Apr 20 09:44:02 2011
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : ds410j:2  (local to host ds410j)
           UUID : 0c915330:873669c9:d4bd0c66:62a08972
         Events : 5778

    Number   Major   Minor   RaidDevice State
       0       8        5        0      active sync   /dev/sda5
       1       8       21        1      active sync   /dev/sdb5
       2       0        0        2      removed
       4       8       53        3      active sync   /dev/sdd5


smartctl is saying the drives "don't support SMART," which is weird because I know I've done SMART tests on them before.

Right now it looks like when drive 3 crashed it restriped the disk to be a RAID5 with only 3 volumes, and then when drive 4 crashed a few hours later it mounted the filesystem read-only. So all my files are accessible, at least. The DS410j assures me that if I just replace disk 4 ASAP everything will be fine, so for now I'm just going to shut it down until I get a replacement disk.
#13896 04/20/2011 10:30 am
That mdadm output is describing a 4-drive RAID5 in which only 3 of the drives are attached. md, by itself, doesn't re-stripe the drives in that case, unless the Synology box is doing something crazy in the background with LVM that I don't know about.

"Clean, degraded" means that it should be mounting okay, albeit in a critical state. You might need to issue an extra command to make it start the array if it's being extra-paranoid. I forget the specifics there.

As for smartctl, make sure you're doing it on the actual drives (/dev/sda) and not the partitions (/dev/sda5).
#13897 04/20/2011 10:41 am
I was doing it on the actual drives, yes.

The Synology is probably doing something crazy in the background. It's set to "hybrid RAID" mode which quite specifically restripes things as drive availability changes (i.e. you can start out with three 1TB drives and end with four 2TB drives and if you only upgrade one drive at a time it will theoretically work out in the end).

I actually don't know when drive 3 failed, now, because what started me down this path was logging in to reboot it since I was having transitory Time Machine failures again (which usually get fixed by a reboot) and noticing that drive 3 had disappeared from the volume, and discovering that I'd forgotten to update the device monitoring alarm SMTP settings when they changed a few months ago. So who knows when and what happened.
#13898 04/21/2011 11:51 pm
Recovery is underway, after one abortive attempt when I realized I was doing things the long way around as always. Basically I've attached one of the Barracudas as an external drive, and am formatting it and will copy all the data from the degraded RAID over. Then I will put the other Barracuda and the new WD20EARS inside, create a whole new RAID, copy the data from the Barracuda, and then RMA the two dead WD20EARSes. and then at some point I'll probably replace one of the older good-batch WD20EARSes with an RMA replacement, just to try to randomize the failure times as much as possible, since having two drives of the same production run that have been running the same amount of time makes me nervous that this whole stupid thing can happen all over again.
#13899 04/22/2011 12:41 am
So the data transfer over USB is going at about 5.5MB/sec, and at this rate it'll take about 35 hours to copy everything. Whee. (But I don't think it's USB's fault. My first attempt at a recovery copy from RAID to a new volume on disk3 was going that slow too. I suspect the slowness is due to the degraded RAID itself.)

Most of that 700GB is crap I definitely don't care about though. As long as my user directory finishes copying overnight I'll be happy, and that's only on the order of 300GB so should take... well, still around 12-16 hours. but still.

When I've gotten the RMA replacement drives, at least one of them is going onto my Mac mini, which I'm going to use as a local Crashplan store. More redundancy and more recovery options is always a good thing.
#13900 04/22/2011 09:36 am
Home directory finished backing up overnight, video directory underway. Cancelled the backup of the music directory since I realized I already have a much more up-to-date backup elsewhere, so that saves a lot of time too.
#13903 04/25/2011 07:58 am
smartctl is saying the drives "don't support SMART," which is weird because I know I've done SMART tests on them before.

Try:
smartctl -a -d ata /dev/sdx
#13905 04/25/2011 08:28 am
Ah, -d ata did the trick.

Not that it really matters now, since I have the RAID rebuilt. All I lost in the end was my TM archive, which is of course trivial to recover, sans version control anyway.
#14769 chichow (unregistered) 04/10/2012 08:02 am yes yes more synology
So from my experience with Synology:

#1 The freebie USB code that they use was known not to have the best performance for external drives. Maybe its been improved, but I haven't followed that.

#2 When I buy drives for my Synology, I buy 5 or 6. 4 for the RAID. 1 for local external USB backup. 1 if I am flush. That way - if or when one dies in the RAID, I have another from that production batch to throw in right away.

so moral is. 4 drive raid - buy 5. 5 drive raid - buy 6 or more

tc