I recently read a draft document aimed at outlining RAID maintenance “best practices”. This has been developed to counter some of the insane practices of our customers such as recycling used drives in RAID configs because the drive “seems OK”.
While the document jumps pretty much straight into proprietary and unique Adaptec features, it made me think of some fairly generic steps a customer can do to protect their data that fit almost every scenario or vendor.
Management software - install it. It’s amazing to me the number of servers out there that don’t have RAID management software installed. Installing the software is the first step … looking at it on a regular basis is the next step. Yes, you can set up all sorts of alarms and notifications in all manner of software, but I certainly have been caught by the network gurus blocking ports on systems I set up ages ago, and while I’ve set the system up correctly, its not working now because of other factors. Simply running the software and actually looking at it is half the battle.
Physical monitoring … wow, I’ve never seen that red light on the front of the box before! Looks pretty doesn’t it? Again, all the software in the world doesn’t replace a set of eyes and some grey matter between the ears.
Email and system notification. Set it up. Test it. Check it regularly. Will your system actually send you a message when something fails? Is the email address it is sending information to still relevant or has that person left the company and that Exchange mailbox been closed? One of the major causes of total RAID failure is customers ignoring an initial drive failure. In a RAID 5 environment they always find out about the second drive failure because the system falls on its face.
Log checking … it’s always an interesting exercise to look in Windows event viewer and find 30 pages of red icons … basically the server screaming about a problem that no-one is taking any notice of. This comes back to running your eyes over your server … physically and via the management software (RAID management softwar and system management software).
RAID integrety checking … all vendors will have a way of checking the integrity of their RAID array on a regular or scheduled basis (or at least they should have). This level of checking is concerned with the consistency and accuracy of the parity data spread amongst the disks. It should be checked on at least a weekly basis (background and automated is the way to go).
Correct drive choice … this one is controversial. Many, many users are running “desktop” drives in servers these days because of price and capacity issues. It’s the drive vendors who specify the difference between desktop and enterprise drives, and you can always get an argument from any tech by stating that one type is better than the other, but make sure you are using drives that your disk drive vendor supports in a RAID environment.
Documentation … what parameters were used when building the array in the first place. What firmware revision was on the card when the array was originally (or subsequently) created? What stripe size was used? Almost every tells me “I just used the defaults”, but almost no-one can tell me what they are
. What size arrays did you make (down to granular size details)? All this information and more may seem trivial and easy to do when you are setting up a system, but when the pressure is on in a failure situation it can make life a lot easier for the tech to sort out your issues.
Correct RAID choice … this is often seen as a performance issue (which it is), but it’s also a redundancy/safety issue. If you are using large sATA drives use RAID 6 instead of RAID 5 … the performance difference is almost negligible these days and the safety of 6 over 5 is important.
Up to date software … keep your firmware, drivers and management software up to date. Most customers are always updating their OS, and it makes good sense to keep your hardware, drivers and RAID management software up to date.
Backups … did I mention that you still need backups? I choke when I hear the statement “we don’t trust our backups” or “we can’t be sure of the integrity of our backs”. For goodness sake, this is one you definitely, positively and absolutely must get right.
If you are doing the majority of the above then you are halfway towards have a stable long-term server that will look after your data for you. So how good are you at “best practice”?
Ciao
Neil
- Share
-
-
-
-
-
-
Send to a friend
-
more...
- | Post a Comment






