Storage Advisors

Surviving component failure …

Tuesday 14th April 2009 - 08:51

Storage Advisors

While it’s good to have good backups, most system admins would like to know just exactly what the consequences of component failure mean for their uptime/downtime/workload scenario. In other words, how big a job will it be to get the server back on line if something falls over.

So let’s look at how the storage subsystem should behave in the event of catastrophic component failure.

The easiest thing on the list are the hard drives. If they fail you should hear a screaming noise from your raid card, you should get an email from your management software (Adaptec Storage Manager) and you should see a red light on the failed drive in the case of a hot-swap chassis. Pretty simple really. Just replace the drive. The RAID card will (depending on settings in the BIOS) automatically rebuild the array, or sit there until you make the new drive a hot spare, in which case the card will then rebuild the array.

We’ll come back to the subject of drive failures towards the end of this because there are issues with drive failures and different RAID types.

Backplane failure? Pretty rare but it should just be a case of replacing the backplane, reconnecting everything and all will be good. The real problem here is that the RAID card is now more than a little annoyed at have had all it’s drives removed and will have to work out what was on the drives. This is simple if the drives all dropped off the card at once. However if they went down in sequence then the card will, at some point, have marked the array as failed and you’ll need to talk to tech support about your options (too long to list here but you are not totally without hope).

Card failure? Just replace the card. Adaptec store their RAID data on all the drives plus on the card itself. The first thing the new card will do is read the metadata from the drives, load the array information into the NVRAM on the card and you’re away. Note that you may be prompted to accept the finding of a new array, which confuses people. It’s not new to you, but it is new to the new card.

Motherboard or total system failure. Just replace the components and your storage will be fine. You can even take the RAID card and array (disks) to another system, plug them in and the card will still know the array and present the data … it’s up to you to work out any OS issues that this causes but it’s generally not the end of the world.

Going back to RAID failures … of course there are diffent RAID levels which have different redundancy capabilities. If you are nuts enough to use RAID 0 you have no insurance - one drive failure will kill everything - don’t bother ringing us, we can’t do anything for you.

RAID 1 can survive 1 drive failure (after all, there are only 2 drives in there). RAID 5 can survive one drive failure - the real problem here is that you can sometimes be caught with a second drive failure during the rebuild after the first drive failure. If the rebuild is not complete then this is regarded as a two drive failure, and you’ve had it. (Note I’ll do a separate article about the dangers of building too large a RAID 5 array later.)

RAID 6 can survive two simultaneous drive failures, so it’s safer than RAID 5 because if you have one drive fail, replace the drive and start the rebuild and another drive fails before the rebuild is complete the system will survive. You’ll be very annoyed, but you’ll still have your data. RAID 10, 50 and 60 can survive varying numbers of drive failures, but you have to be lucky which drives fail. In general you don’t want to count on being able to survive multiple drive failures in these configs, but most of the time you can.

Of course we recommend you have a hot spare in your system. This is just a drive sitting there watching all other drives (but doing nothing itself). When one drive dies, the card will initiate a rebuild onto the hot spare minimising your downtime. While the system is rebuilding you can trot off to the shop to get the failed drive replaced etc. When you get the new drive you replace the failed drive then make the new drive the hot spare. Don’t forget to move the nice neat printed label you put on the front of the system indicating which drive is the hot spare. Do not, under any circumstances, try to re-arrange the drives so that the hot spare is back where you originally had it, either physically or per drive ID. We can handle a bit of randomness - it’s humans that just have to be neat.

So as you can see, you can survive a fair amount of damage happening on your system without your world falling to pieces, but that does not, ever, mean you don’t need good backups. My mate Murphy was an optimist … if you have good backups then nothing much ever goes wrong. It seems to me that if you don’t have backups then fate kicks you at the worst possible time.

Ciao
Neil

More about: General
If you found this article interesting, please consider subscribing to our RSS feed, or becoming a member of biz-news to have future articles delivered to your feed reader or mailbox
Neil
Advertise with us and reach to an audience of thousands of High Tech professionals
Comments
Your Name *
Your Email *
Your email will not be disclosed anywhere
Antispam Control


Latest News