Mathimagical
12 years ago
General
Those of us who are computer geeks (and maybe even some of us who aren't) have probably heard of RAID arrays. They're arrays of individual disk drives that provide some amount of redundancy to promote availability. If a drive in the array fails, the array can continue to operate. Most often it continues at a degraded level of performance, but it is still available.
Before someone brings it up, RAID 0 isn't often considered an official RAID level. There is really no redundancy, given all the drives carry a single stripe-set and if any drive fails, you pretty much write off the whole array. It's more of an AID I suppose... But for RAID levels above 0, a drive in your array can fail and you won't lose any data.
There's no big mystery at RAID 1. RAID 1 is also known as mirroring, and that's exactly what it does. Two same-sized drives that are exact duplicates of each other. Whatever is written to one is written to the other. If one fails, all your data is still intact on the other.
Things get more mysterious from there. When you have a RAID 5 array that uses all your drives, how can a drive in that array fail, yet the system can run along as if nothing was ever wrong? How does it find the data from the failed drive? Superserious voodoo magics? The answer is math. Or more specifically, the exclusive OR logical operator.
You're probably passingly familiar with logical operators. If you're a programmer and you've written a compound IF statement, you've used them. In fact, I just did it in that last sentence. AND and OR are two well-used logical operators. OR does just what it says on the tin: if one or the other condition is true, it returns true.
Exclusive OR (XOR) is very similar, only the two choices are exclusive. You can have this or that, but not both. So in this case, if one or the other condition is true, exclusive or returns true. But if both conditions are true, it returns false. And it turns out when you start XORing numbers together, strange, spooky things happen!
Audience participation time! We're going to make our own theoretical RAID 4 array, so grab a spreadsheet or pencil and paper and let's do this! Set up five columns, each one representing a drive in the array.
When writing to RAID 4 or 5 arrays, data is written to all the drives except one, and those pieces of data are XORed together to get a parity checksum that is written to the last drive. In RAID 4, one drive is dedicated to hold the parity data. In RAID 5, the parity data is distributed across all the drives, rotated as each new stripe is written. I'm using RAID 4 in our theoretical example to keep things simple and have the parity data always on the last drive.
So here we go! Write down numbers on the first row for the first four columns. This is your data. Grab a calculator that will XOR (Windows calculator will do it at the very least) and XOR all your data together. Type in a number and XOR it with the next number, and the next and so on. When you get to the end, press equals and that's your parity checksum. Write it down in the fifth column. Do a few more rows if you like, computing the parity checksum after each row. Hey, you're building a RAID 4 array! Good job!
OH NO! DISASTER! One of your drives just failed! Choose a drive (or roll a die) and consider it failed. If by some stroke of luck it's the parity drive that failed, you can see there's really no missing data. You should still replace that drive, because if another drive fails, you'll be missing data for sure.
But what if it's one of the data drives that failed? How do you get that missing data back? It's actually frighteningly easy. XOR together the remaining data and the checksum. What do you get for the answer? I know, right?
Obviously the array doesn't run as efficiently when missing a drive because it has to do extra computations while reading data to find the missing part. However the data is still intact and accessible and once you replace the failed drive with a new one, the data can be rebuilt from the existing data and the checksums.
Extra Credit: Try the exercise with more or less than 5 drives. Also note why RAID 4/5 has a three-drive minimum.
More Bonus: Make it a RAID 5 array. Write four pieces of data on drives 1-4 and parity on 5. Next row, write four pieces of data on drive 2-5 and parity on 1. Next row, write four pieces of data on drives 3-1 and parity on 2...
All this extra math adds a performance penalty, and RAID 5 does have a penalty when writing because of computing the checksum. Dedicated hardware-RAID cards have a few methods to ease this penalty, if not overcome it.
The first is dedicated hardware for computing XORs. After all, we know there's gonna be a lot of it going on, let's make a math accelerator chip that can do XORs really quickly. This will ease up on the time it takes to do the math.
The other way is a rather large buffer, often 256MB and sometimes more. When operating in Write-Back mode, data to be written to the array is stuffed into the buffer. The card signals to the OS that the data has been written and then takes the time it needs to write it out to the array, freeing up the OS.
Sound dangerous? Yes, it's very dangerous. What happens if the power goes out between the time the card tells the OS all is okay and the data gets written? You lose the data, that's what. And this is why most cards that have cache also come with a battery backup for that cache. In fact, the HP SmartArray cards will not allow you to use write-back cache if the battery is not up to snuff.
Should the power fail during a write operation, the RAID controller stops what it's doing and the cache remains intact, backed up by the battery. Once power is restored, the controller can finish the write that was pending and then go on with the tasks of booting up the machine. The chances of losing that data or having it corrupted are greatly reduced.
So why did RAID 5 flourish and RAID 4 slipped away into the recesses of computer history? As I recall, the main reason is whichever drive holds the parity data takes a lot of abuse. Every time you change data, new parity has to be written. When you're writing out new contiguous data, it isn't so apparent. Drives 1-4 get written and then drive 5 gets the checksum. They all sort of get tapped equally.
Think about modifications though. If you modify a small piece of data that's entirely on drive 2, drive 5 still has to get hit to update the parity data. Modify something on drives 3 and 4? Drive 5 takes a hit as well. Anything you change on the array, drive 5 is gonna do a write operation. Because the parity is distributed across all drives in RAID 5, no single drive gets punished when doing write operations.
Before someone brings it up, RAID 0 isn't often considered an official RAID level. There is really no redundancy, given all the drives carry a single stripe-set and if any drive fails, you pretty much write off the whole array. It's more of an AID I suppose... But for RAID levels above 0, a drive in your array can fail and you won't lose any data.
There's no big mystery at RAID 1. RAID 1 is also known as mirroring, and that's exactly what it does. Two same-sized drives that are exact duplicates of each other. Whatever is written to one is written to the other. If one fails, all your data is still intact on the other.
Things get more mysterious from there. When you have a RAID 5 array that uses all your drives, how can a drive in that array fail, yet the system can run along as if nothing was ever wrong? How does it find the data from the failed drive? Superserious voodoo magics? The answer is math. Or more specifically, the exclusive OR logical operator.
You're probably passingly familiar with logical operators. If you're a programmer and you've written a compound IF statement, you've used them. In fact, I just did it in that last sentence. AND and OR are two well-used logical operators. OR does just what it says on the tin: if one or the other condition is true, it returns true.
Exclusive OR (XOR) is very similar, only the two choices are exclusive. You can have this or that, but not both. So in this case, if one or the other condition is true, exclusive or returns true. But if both conditions are true, it returns false. And it turns out when you start XORing numbers together, strange, spooky things happen!
Audience participation time! We're going to make our own theoretical RAID 4 array, so grab a spreadsheet or pencil and paper and let's do this! Set up five columns, each one representing a drive in the array.
When writing to RAID 4 or 5 arrays, data is written to all the drives except one, and those pieces of data are XORed together to get a parity checksum that is written to the last drive. In RAID 4, one drive is dedicated to hold the parity data. In RAID 5, the parity data is distributed across all the drives, rotated as each new stripe is written. I'm using RAID 4 in our theoretical example to keep things simple and have the parity data always on the last drive.
So here we go! Write down numbers on the first row for the first four columns. This is your data. Grab a calculator that will XOR (Windows calculator will do it at the very least) and XOR all your data together. Type in a number and XOR it with the next number, and the next and so on. When you get to the end, press equals and that's your parity checksum. Write it down in the fifth column. Do a few more rows if you like, computing the parity checksum after each row. Hey, you're building a RAID 4 array! Good job!
OH NO! DISASTER! One of your drives just failed! Choose a drive (or roll a die) and consider it failed. If by some stroke of luck it's the parity drive that failed, you can see there's really no missing data. You should still replace that drive, because if another drive fails, you'll be missing data for sure.
But what if it's one of the data drives that failed? How do you get that missing data back? It's actually frighteningly easy. XOR together the remaining data and the checksum. What do you get for the answer? I know, right?
Obviously the array doesn't run as efficiently when missing a drive because it has to do extra computations while reading data to find the missing part. However the data is still intact and accessible and once you replace the failed drive with a new one, the data can be rebuilt from the existing data and the checksums.
Extra Credit: Try the exercise with more or less than 5 drives. Also note why RAID 4/5 has a three-drive minimum.
More Bonus: Make it a RAID 5 array. Write four pieces of data on drives 1-4 and parity on 5. Next row, write four pieces of data on drive 2-5 and parity on 1. Next row, write four pieces of data on drives 3-1 and parity on 2...
All this extra math adds a performance penalty, and RAID 5 does have a penalty when writing because of computing the checksum. Dedicated hardware-RAID cards have a few methods to ease this penalty, if not overcome it.
The first is dedicated hardware for computing XORs. After all, we know there's gonna be a lot of it going on, let's make a math accelerator chip that can do XORs really quickly. This will ease up on the time it takes to do the math.
The other way is a rather large buffer, often 256MB and sometimes more. When operating in Write-Back mode, data to be written to the array is stuffed into the buffer. The card signals to the OS that the data has been written and then takes the time it needs to write it out to the array, freeing up the OS.
Sound dangerous? Yes, it's very dangerous. What happens if the power goes out between the time the card tells the OS all is okay and the data gets written? You lose the data, that's what. And this is why most cards that have cache also come with a battery backup for that cache. In fact, the HP SmartArray cards will not allow you to use write-back cache if the battery is not up to snuff.
Should the power fail during a write operation, the RAID controller stops what it's doing and the cache remains intact, backed up by the battery. Once power is restored, the controller can finish the write that was pending and then go on with the tasks of booting up the machine. The chances of losing that data or having it corrupted are greatly reduced.
So why did RAID 5 flourish and RAID 4 slipped away into the recesses of computer history? As I recall, the main reason is whichever drive holds the parity data takes a lot of abuse. Every time you change data, new parity has to be written. When you're writing out new contiguous data, it isn't so apparent. Drives 1-4 get written and then drive 5 gets the checksum. They all sort of get tapped equally.
Think about modifications though. If you modify a small piece of data that's entirely on drive 2, drive 5 still has to get hit to update the parity data. Modify something on drives 3 and 4? Drive 5 takes a hit as well. Anything you change on the array, drive 5 is gonna do a write operation. Because the parity is distributed across all drives in RAID 5, no single drive gets punished when doing write operations.
FA+

Then there's another problem. A block and the XOR don't match. There's no checksum or data error from the drive, so WHICH block is bad? You don't know. ZFS checksums the parity and data blocks, so you can know which is bad. Also it can internally use an SSD for a write-back cache. It's all very interesting.