Thread: RAID 5 danger with URE's

Reply to Thread
Results 1 to 10 of 10
  1. #1 RAID 5 danger with URE's 
    Senior Member
    Join Date
    Mar 2011
    Location
    Southern California
    Posts
    169
    In thinking about what kind of storage to buy to start working as a DIT, I've come across many articles such as this one:
    http://www.zdnet.com/blog/storage/wh...ng-in-2009/162
    which explain why RAID 5 is extremely dangerous with modern large capacity drives. Basically the bit error rate is so high on common drives (1 in 1015) that in the event of a drive failure, you are very likely to have an error during rebuild. I'd like to hear from working DIT's on here, what do you think about this? Does everyone use RAID 10? Or for the same money, use RAID 5 with SAS drives with a better error rate (1 in 1017)?
    Reply With Quote  
     

  2. #2  
    Senior Member Jarek Zabczynski's Avatar
    Join Date
    Apr 2007
    Location
    Poughkeepsie, NY
    Posts
    1,116
    I remember reading that a few years back. Would love to hear Jeff's take on this.
    Shoot for the Impossible...Then do it.

    Jarek Zabczynski
    Director / Editor / Cinematographer


    Scarlet X - #525 | Epic X - #??? | www.jarek.com | WE'LL BE ALRIGHT (Music Video) | INCREDIBLE (Scarlet Music Video)
    Reply With Quote  
     

  3. #3  
    Senior Member Bob Gundu's Avatar
    Join Date
    Mar 2011
    Location
    Toronto, ON Canada
    Posts
    3,708
    ___________________________

    VFX, Cinematographer, Photographer
    10 frame handles
    Vimeo
    Scarlet #329 "HAL"
    Reply With Quote  
     

  4. #4  
    I mostly disagree with that assessment linked to on ZDNet. Oh, wait, that article is 5 years old and has been pretty much proven to be a bunch of rubbish. It's true that bit errors are more common with higher capacity drives, but they're also pushing around and storing a lot more data than drives were 10 years ago, let alone 15 years ago when the hottest shit on the market was a 4GB drive. In contrast to the actual number of bits stored on a hard drive these days, the error rate is much lower in a proportional sense than it ever has been before. Error monitoring and correction onboard both drives and host controllers is more robust and effective than it ever has been in the past.

    Of course data corruption is always a threat and errors can and do happen. Any primary copies of mission-critical data should be verified and maintained through a reliable verification process, such as using MD5 checksums or other comparative means. Of all the times I see data corruption, it is almost always, always, always! due to a failing component on a drive or failing controller, bad cabling or power issue.

    Error potential increases in systems more so with the number of drives as opposed to an increase in capacity per drive. That's been my experience and what most in the IT industry seem to accept. As RAID systems grow with more drives, we begin to see more elaborate configurations to improve redundancy and reduce down-time in the event failure and subsequent recovery methods are needed. RAID-6+1 with multiple hot spares is a popular configuration in the datacenter I consult for. Typically that involves multiple RAID-6 arrays that mirror one another and each has a hot spare drive available for every so many active drives in the array. Redundant? Yes. Expensive? Yes. Reliable? Yes.

    For those unfamiliar with RAID-6, it is like RAID-5 but with an additional block of parity data. It eats up more capacity for sure, and it also increases the fault-tolerance of the system quite significantly.

    Going back to the error potential discussed above, all good RAID hosts create and validate this parity data non-stop. Parity data is read and compared on the fly when retrieving data in secure RAID systems, if one drive exhibits parity errors this can be logged and they can be corrected out. If the problem is persistent, then the drive is usually flagged or immediately locked out and a spare is engaged to rebuild the array. It is also wise to maintain large arrays by occasionally "refreshing" them, essentially a rebuild process where all parity data is verified across all data on the volume.

    For error control on individual drives or simple arrays such as a RAID-0 volume, we're pretty much left with basic functionality of our controllers and the error control systems on the hard drives themselves. Don't discount these systems, they do a rather good job. Just always be mindful that drives can and do fail as all eventually will after enough regular use or even the simple passage of time. It's always prudent to maintain reliable backups and to never put all your eggs into one basket as they say.
    - Jeff Kilgroe
    - Applied Visual Technologies, LLC | RojoMojo
    - EPIC-M Package Available! Over 1TB SSD media, RPP's & more.


    List of all current RED software tools.
    Reply With Quote  
     

  5. #5  
    Jeff thanks for the first hand experience from the real world, posts like yours makes Reduser worth checking.
    Reply With Quote  
     

  6. #6  
    Junior Member
    Join Date
    Jan 2012
    Posts
    10
    agreed! wow jeff...

    question. early on in your supermicro mobo experience i asked you to run a maxmem test to see the bandwidth on the ddr. you provided results. would you mind running that very same test on your HPz? i am quite certain i have pinned this slow memory bandwidth to the sandy bridge processers with intel c7x chipset when used in dual processor mode. i believe the HP z is on a different chipset and might yeild higher scores...

    thx!

    -Jason Enzer
    Reply With Quote  
     

  7. #7  
    Senior Member
    Join Date
    Mar 2011
    Location
    Southern California
    Posts
    169
    Thanks for the detailed response, Jeff. But I'm not understanding one point: You say
    the error rate is much lower in a proportional sense than it ever has been before
    But according to Western Digital's specs, the RE4 drives are rated at <1 in 1015 non-recoverable read errors per bits read. The RE SAS drives are <10 in 1016 (which is a fancier way of saying the same thing I think). So that rate is exactly the same as it was several years ago. And they are talking about non-recoverable errors, so I would think that accounts for error correction already.In a hypothetical array of eight 2TB drives in RAID-5, if one drive fails, the rebuild process will be reading 14TB of data. That's an 11% chance of a URE. What I'm fuzzy on, is what happens after the URE? Does the entire rebuild fail, and have to be retried? Will the array even notice the URE, or will it continue rebuilding with corrupted data?

    http://www.wdc.com/wdproducts/librar...879-771386.pdf
    http://www.wdc.com/wdproducts/librar...879-701338.pdf
    Reply With Quote  
     

  8. #8  
    Senior Member
    Join Date
    Dec 2006
    Posts
    3,287
    Can't beat Jeff's knowledge, but my guess is that manufacturer's error rates are probably conservative. And remember they are also statistical. You might get an error straight away or not at all during the life of the drive.
    Director/Digital Camera Operator/2nd AC/DIT/Data Manager
    London, UK.

    Life moves pretty fast. If you don't stop and look around once in a while, you could miss it.
    Reply With Quote  
     

  9. #9  
    All statistical and estimated. Drives produce far fewer errors these days (in proportion to capacity) which actually translate to corrupted data. A non-recoverable error does not necessarily mean an error where that erroneous bit is lost to oblivion. It means an error in which a specific bit that is requested from the drive or to be written to the drive fails. Sometimes this is a problem, sometimes not, depends on the system and circumstances surrounding that data transaction.

    1 in 10^15 and 10 in 10^16 is indeed the exact same thing. Just depends on who is writing the marketing paper that day. The latter is an odd way to state it, but I'm sure someone probably thought 10^16 looked more impressive.

    So how many errors is that? Well, 10^15 bits is 1000000000000000 bits. That's a shitload of bits... In fact, in binary quantity speak, it's 113.687TB. So, what they're trying to say is that for every 113.687 TB of data written or read from the drive, they may have an erroneous bit... Or said erroneous bit should not occur more than once in every 113TB of data transacted. OK, so what does it really mean? Nothing. When a drive fails, it fails. you'll probably see a head crash or motor crash long before an anomalously erroneous bit pops up. And if those more common crashes occur, you'll have so many garbled bits, you won't know what to do with them.
    - Jeff Kilgroe
    - Applied Visual Technologies, LLC | RojoMojo
    - EPIC-M Package Available! Over 1TB SSD media, RPP's & more.


    List of all current RED software tools.
    Reply With Quote  
     

  10. #10  
    Senior Member
    Join Date
    Nov 2009
    Posts
    787
    I'll add some info on the Western Digital RE4 drives.

    They are actually designed to fail in a RAID array, which sounds strange, but ... it's exactly what you want to have happen.

    Instead of heroically attempting to recover a bad sector, they give up, as they know they are in a RAID array, and map out the bad or marginal sector quickly.

    This means 2 things:

    1. You don't get slowed down by normal drives that could take 30 seconds or more to correct a marginal sector (rather disturbing for a real-time app dealing with audio or video).

    2. The marginal sectors get mapped out very quickly (I ?think? more aggressively than 'normal' drives), and are thus less likely to cause problems in the future.

    I ?think? that #2 would increase the reliability of the drive over time, except for catastrophic failure where the entire drive gives up the ghost.

    Good RAID controllers can be set to scan the drives continuously, and do a good job of policing themselves of potentially bad sectors.

    RAID 6 controllers add an extra level of security.

    The main thing you have to worry about is 1 or two drives failing during the rebuild (depending on RAID level).

    So, the size of the drives used (which is directly proportional to the rebulid speed) is very important.

    A 4TB drive takes twice as long to rebuild as a 2TB drive.

    The density of the drive is also important. Higher density generally means more recoverable errors.

    You'd go crazy if you saw how much error correction is going on right now for all your hd reads.

    Using slightly older, less dense technology, but with more platters for more storage, ?may? give higher reliability?
    Reply With Quote  
     

Tags for this Thread

View Tag Cloud

Posting Permissions
  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts