Friday, January 28, 2011

When is RAID worth the trouble?

In our shop we're faithfully using RAID in all our workstations, probably just because that seems to be the way it ought to be done. I'm talking about workstations for scientific simulations, using the onboard RAID chips.

But I've heard a lot of RAID horror stories. Stackoverflow itself has had an outage caused indirectly by RAID controller.

RAID protects you against a very narrow type of failure - physical disk failure - but at the same time it also introduces extra points of failure. There can be problems with the RAID controller, and there often are. In our shop at least, it seems that RAID controllers fail at least as often as disks themselves. You can also easily mess something up with the process of swapping a faulty drive.

When is RAID worth the trouble? Don't you get a better return on investment by adding more redundancy to your backup solutions? Which type of RAID is better or worse in this regard?

Edit: I've changed the title from the original "Is RAID worth the trouble?", so it sounds less negative

  • Don't worry, RAID isn't used throughout the business world because of groupthink! The chance of decent RAID controllers failing is far, far lower than the chance of a disk failure. I don't recall ever seeing a RAID controller fail in real life, while I've seen many a disk die, both in the office and datacenter.

    PS: I see your tags. RAID is not backup! :)

    amarillion : Right, it's not backup. So then it's redundancy? So it's really all about high up-times? Unless you need five nines, you don't really need RAID?
    Matt Simmons : No, it's about availability. Taking down the machine when you want to is fine. Having a single hard drive decide to take down your machine isn't. Using RAID properly prevents that from happening.
    Wedge : @amarillion. Wow, that's a dangerous sentiment. How much experience with hard-drives do you have? RAID is pretty much required for even *2* nines of reliability (more so the more hard-drives are in the mix), and RAID alone definitely will not get you to 5 nines, you'll need redundant datacenters for that, at least. Even then it's a crapshoot, 5 nines is management fantasy land BS, that's less than an hour of downtime per decade (~5 min/year). Not even IP backbones have that.
    amarillion : Alright, I was just exaggerating with the 5 nines. My point is, In our case we're probably using RAID for the wrong reason.
    duffbeer703 : @amarillion: Some of my customers have developers on site billing $200/hr. Or workers responding to life or death situations. Disrupting those workers for wont of an $80 hard disk seems kinda dumb to me, YMMV.
    Brock Woolf : What are you talking about. I have a RAID linux machine that IS my BACKUP for my Mac. I think you're getting a little too 'clever' with your technicalities Alex.
    Alex Jurkiewicz : No. RAID protects you from hard drive failure. It does not protect you from 'rm -rf /'. THAT is what backups are for!
  • Why bother on a workstation? Surely you have all your home directories and data stored centrally. That is where you want to use raid.

    From Geoff
  • ZFS by SUN (also part of OpenSolaris; Apples OSX - currently read only) not only does raid with various levels but always check to see if the data written to disk is actually there. consistency is key! RAID is useless if you can´t rely on its integrity. Pick a decent RAID controller (I prefer HP´s) and scrub your RAID to find errors periodically.

    Softwareraid (as ZFS) on the other hand amkes you more hardware independant if the RAID controller dies and you can´t get an exact replacement.

    From lepole
  • Harddrive failures are much more likely to happen in a server than a desktop workstation...

    You can't just say "adding more point of failure" without taking into account the likely-hood of that failure. Especially since these less likely points of failure are specifically inplace to subvert the more likely harddrive crash. As you've put it, you've basically created a pascal's wager like fallacy.

    Most raids on desktop motherboards are cheapo software/hardwre hybrids with most of the work done in it's software driver. IMHO they are peices of crap used to sell to power-users.

    On the otherhand, a good actual hardware raid is quite reliable, and it has the hardware to do it's thing without(despite?) the operating system. But those get expensive, because real hardware usually hasbattery backups, and a complete xor'ing array to calculate checksums etc. Even more expensive if it's done using scsi.

    TLDR: If you are running the mobo based raid systems, then no, it isn't worth the trouble.

    duffbeer703 : A colleague runs a large school IT environment with 180,000 workstations with a top-notch helpdesk. 7% of their desktops require a hardware replacement within their 5 year lifecycle, and 85% of those replacements are hard disks.
    Ape-Inago : Yeah, but if a workstation goes down, you just have the user log into another machine while you are fixing the broken one. With that many workstations, their aught to be a central file repo. I wonder what the statistic would look like with 180,000 servers.
    duffbeer703 : You're right for many circumstances -- but not for everyone. In my friend's scenario, many of those PCs are in the back of classrooms, and if they are broken, that class doesn't have a computer and its a big deal. At my job, we have spare workstations and don't really care.
    From Ape-Inago
  • What is your failure rate on hard disks and raid controllers? Failure on the raid controller should be far lower than the disks. If you have a high failure rate you may want to look at your environment such as static discharges that could be causing issues.

    For workstations you may want to use software raid as suggested by Alakdae because you won't have to worry about maintaining stocks of the precise hardware controller. However you should have all vital information stored on your servers which do have hardware raid and are backed up to different media.

    Server hardware manufacturers do maintain raid controllers so even if it's an older controller you can usually still get it from them if you need to (it'll cost you a pretty penny though).

    From David Yu
  • I just had the RAID controllers in two (identical) servers fail, since we got those two machines we didn't have one hard disk failure in the entire company.

    I think RAID on desktop is a bad idea, the cheap RAID controllers you're going to put on those machines will fail long before the actual hard drive.

    On servers, maybe, I'm not going to trust RAID controllers again, make sure you have a spare machine and good backups.

    From Nir
  • There are two types of RAID

    • One that is cheap integrated. This is NOT a real raid the real work is done by the software (special driver the does the raid computations). You should avoid this one.
    • The other one is expensive, but what you get is real raid. If you can afford this it worth the money.

    Some operating systems has good software raid solution (this has nothing to do with the crappy cards mentioned above). Linux software raid is especially good, its performance is really good.

    Raid can only improve reliability it is not a backup solution. Files can be deleted accidentally, faulty disk can return (and duplicate) bad data to other disks in a raid array, so a real backup solution still needed.

    From cstamas
  • I am a developer and all our workstations use RAID for the internal drives. RAID 0. This is definetly worth it. You never want to go back to compiling from a single 7200RPM drive once you have tried a pair of 15000s.
    I have been challenged on if it is the RAID or the 15k drive that is making compile times shorter. I don't know, for compile a single fast drive may give exactly the same performance. However, a single SAS drive is not particularly large for a modern pc, so in-expensive on board RAID still has a place. That and I doubt RAID is ever going to hurt the performance of the system.
    I think this sort of RAID is certainly appropriate for a workstation and is probably best done using the inexpensive on-board controllers. From the server side, most of our servers have some form of RAID array for the OS disk and data is then on a seperate array of some appropriate form. I don't know about our production servers but our dev servers (of which we have a fair amount) have never had a controller fail, we have had drives fail though. In one case we had half of the OS array fail on a SQL box, while it was re-building, the other disc failed! Sometimes RAID1 just ain't enough!

    niXar : I have to call BS on this one. RAID 0 is useless for a developer workstation. RAID 0 at best doubles transfer rates; it does nothing for random access. Guess what developers do ... read and write lots of tiny files, and the occasional large-ish one. The only workstation it would be useful would be that of a graphic designer doing video editing, where you need all the GB/s you can get.
    pipTheGeek : This may be true, I haven't compared the performance of a single 15k sas drive to that of the dual drive raid 0. I have updated my answer.
    duffbeer703 : It depends on what your developers do. We have guys that work with big datasets who notice a significant performance improvement, especially during compiles. GIS guys notice an improvment with RAID 0 too.
    Loren Pechtel : Going from a 7.2k to a 15k drive would mean a substantial speedup. There's not a lot more to be gained from Raid 0.
    From pipTheGeek
  • Always. Disks are cheap, your information is not. But use software RAID, so you have the flexibility to move forward or change hardware later on (trust me, you will need it). And also use a checksumming filesystem like ZFS, to protect against silent data corruption (which is very likely with large disks nowadays).

    From Rudd-O
  • Although backups and RAID are solutions to different problems, most "RAID problems" are very similar to the most common backup problem (ie. nobody tests a restore) -- nobody tests system recovery. Other RAID problems are often a direct result of people not understanding what it does and doesn't do. For example, many people think that RAID guarantees the integrity of their data -- it does not.

    For workstations, if you're using RAID-0 to improve performance of IO-bound applications, or RAID-1/5/6 to keep to $100/hour scientist working when her $80 hard disk fails, you're using RAID appropriately. Just don't confuse disk redundancy with backup, and have tested procedures in place to ensure that your IT guys handle recovery.

    osij2is : Good note for workstations. Workstation needs are completely different than server needs. And an *emphatic* yes on "..don't confuse disk redundancy with backup".
  • For those of you saying you won't use hardware RAID because if the controller fails and you can't get an identcial replacement your screwed, you're going about it the wrong way.

    1. If uptime is that critical to you, you should NOT be buying cheap hardware. As was said before, use a good raid controller, HP, LSI, Dell etc.

    2. If the controller was purchased from the computer manufacturer, ie Dell server, with Dell RAID controller, Dell will tell you how long they will be stocking those parts, usually this in the in the 4+ year from the EOL of that server.

    3. If having someone running again quickly means you cannot wait for the delivery then you should be purchasing a second spare controller for yourself, regardless of who made it.

    4. If you setup as a RAID 1, you can sometimes take that one of those drives and drop them on a normal controller to recover the data. If that is important to you, confirm/test this with your controller before you are in a critial situation.

    Hardware RAID saved my butt 2x. Once in an email server one of the drives failed, I got the email alert from the raid monitoring software on that machine, called up dell and had a new drive the next day, poped it in and it rebuild all on its own. ZERO downtime on that one

    Second one, had a drive fail in an old file server that was scheduled for replacement in 6 months. The controller kept it running and we moved the replacement of the server up to that week. Saved buying a new drive (since it was out of warrenty) and again ZERO downtime.

    I've used software raids before and they just don't recover as nicely as hardware based one. You have to test your setup, software or hardware to be sure it works and know what to do when the brown stuff hits the fan.

    osij2is : People tend to look at RAID as a type of insurance. If they don't get an "accident", then the benefits of RAID (insurance) don't ever seem apparent. Thanks for sharing your story as many people (I think) take RAID lightly because if they never have a bad experience, why invest in something that may not happen? This should be a lesson for everyone who's reading: a solid, hardware RAID controller will save your ass in that one in a million/billion chances. Don't leave it to chance; always use a good hardware RAID controller especially for servers.
    From LEAT
  • Linux software Raid is excellent, and it actually beats low-end hardware raid hands down. It also has a few optimizations that can be useful for a workstation. For example, it can read different things on each disk at the same time, effectively doubling random access read times, which is a common use case unlike transfer rate-bound operations optimized by RAID0.

    As for reliability, it's a very well maintained part of the Linux kernel, used by millions, it handles hardware failure very well, so it's clearly a win as far as availability is concerned. I have used it on my personal workstations as well as a few dozen low-end servers for years, some pretty loaded, and never could attribute it any fault. I've experienced a good dozen broken disks in the mean time, however.

    (Higher end hardware raid cards have other features though, such as battery-backed write cache. It basically multiplies random synchronized disk write speed by ten. Absolutely necessary for databases, probably pretty useless for workstations.)

    Bill Weiss : I hope it doubles the random access read /speed/ , not read /time/ :)
    From niXar
  • For your scientific workstations it may well be worth it IF those systems work better with their data stored locally, as opposed to a share on a file server. For the general populace however I'd say no. It's not worth the hassle and headache when all you really need is to restore data that should be kept on shares.

  • RAID is only useful when you absolutely positively can't have the server go down unexpectedly. We use RAID on all our servers in our datacentre where there isn't some other form of redundancy. For example we don't use RAID on our webservers, because there's another 10 still working.

    The litmus test is "if a disk breaks in the middle of the night and it can't wait until 9am, it needs RAID"

  • If you worry about a drive controller failing, then you also need to consider the server failing - fans, motherboard, RAM, network.. and then you also need to consider the router failing, and the cabling, and the power... and you also need to consider the datacentre failing (flood, fire, human error), and then you need to consider the external network failing (cables cut - all the time in some places!).

    In short, you can worry about site downtime so much you'd never bother putting anything online at all! Or you could factor the risk of failure against the cost of redundancy and get a much more realistic approach. And of all the things I listed, the hard drive is the single most likely point of failure.

    Next to human error, that is. Who type "shutdown -h now" when they wanted to reboot.... :(

    From gbjbaanb
  • RAID is worth the trouble when you have a battery-backed controller.

    For server applications which frequently fdatasync() log files (which is not uncommon in databases) for durability, you'll end up writing the same blocks over and over again. This will kill IO performance if you don't have a battery backed controller.

    If you DO have a battery backed controller, many of the writes won't even reach the discs, instead just staying in memory until they're replaced by another write. This is a Good Thing.

    The redundancy is a bonus but not essential, as important things should be redundant at a system level.

    From MarkR
  • Cheap RAID implementations are terrible.

    Your choices are, in order of reliability:

    1) HP DL servers with their hardware RAID.
    2) 3Ware RAID cards.
    3) ZFS
    4) Linux Software Raid
    

    Anything else is asking for trouble, and indeed may result in lower overall reliability than a non-RAID solution.

    Consider what to do if your controller fails and the manufacturer is out of business.

    Consider whether you can recover from an apparent double-disk failure caused by power/cabling issues.

    Those are two examples among hundreds.

    From carlito
  • RAID is great for uptime, but it's not a substitute for backup. As a colleague once commented, "You know that 'Oh, sh!t' moment when you deleted something accidentally? RAID just means you get to 'Oh, sh!t' more than one drive at the same time."

    That said, that day when you pop your head into your boss's office and tell her, "By the way, the database server had a hard drive crash last night-- we never went down, it finished rebuilding onto the spare at 5 AM and I've sent the bad drive off under warranty" -- that's when RAID is priceless.

    From
  • For workstations RAID is probably not worth it compared to having a new system on which data can be restored...

    Many were talking about RAID 0...that's not there to help availability. You're doubling the chances of the volume failing, since once one drive dies you lose the whole thing. RAID 0 is just about playing with speed of access to reads/writes on a volume and giving more storage. The only way this could help in a business environment is to take two RAID 0's and mirror them as RAID 1.

    RAID is not a backup solution, as has been pointed out.

    RAID is also not perfect. I think this post from this guy's blog kind of sums up how I feel about RAID and when it's worth it: Thinking of RAID?

    On a workstation you should be able to get one person to use another system while a replacement is rolled out. Why use RAID? His or her data should be stored on the server where management, data integrity and backups are centralized. The workstation should be configured so that it can be periodically upgraded or altered as finances allow and the RAID is just another layer of cost and headache to manage (plus power use and heating issues with added drives and airflow imposition). In the majority of cases for businesses it's probably far more cost effective to put the money from a RAID card into a bigger drive, and if you're using onboard RAID then you're still going to have issues since it tends to tie the RAID format to the motherboard (and it's not true RAID anyway...it's found in Google searches as "fake raid".) Unless you get a very similar motherboard to replace one when it goes bad you may not be able to get back into your RAID volume!

0 comments:

Post a Comment