Awkward moment of the day: you're attending a geek conference of some sort, and you walk up to a group of nerds. The group stops talking as the nearest nerd turns to you and asks you, "What do you think about [SOMETHING YOU HAVE NO IDEA ABOUT] ?" You freeze.... just a little bit too long, and the group collectively snorts their disapproval. You're now "that guy that doesn't know anything."
Of course, you probably know a lot (just not that one topic), but no nerd wants to admit that he doesn't know something. Knowledge is our most important characteristic, so it's time to fill in some gaps. Today's topic: storage and memory (aka RAM and hard drives)! This guide should be a crash course in all the major points you should know about this topic.
=== MEMORY / RAM ====
Sometimes there are so many little details to know about things like RAM that it starts becoming a fuzzy idea. You don't know exactly why MORE RAM is better or how it makes your computer/server run faster, it just does. But you don't need to know all the details as long as you know the following basics. Let's start with this:
If someone asked you the lyrics to your favorite song, what would be faster - repeating them from memory or looking them up?
Most people, barring a disability, would say it's faster to repeat it from memory. Even though you can use Google to look up lyrics pretty quickly, it's still far slower than just knowing them. This is the EXACT same scenario with computers.
RAM is basically just a REALLY REALLY REALLY
fast hard drive. The main difference (besides speed), is that it doesn't store any data permanently.
So how is RAM actually used? Well, since it doesn't store anything permanently, data is always being added to and removed from the RAM (whatever the software thinks is the most appropriate use of the RAM). When you're running software (like a server), usually the whole program is loaded into RAM so that all the instructions (do this when the mouse is clicked, do that when a key is pressed) are readily available.
Once a program is loaded into memory, it often will load additional files (anything that it needs to access frequently) into RAM, too. So naturally, if you have more RAM, then you can keep more stuff in it at the same time. If you don't have very much RAM, then whenever another program needs to use some RAM, one of two things happens:
1. Some of the data in the RAM gets removed to make room for the new program/data.
2. Virtual memory / a swap file is used.
More about virtual memory in just a moment. First, a quick note about caching and RAM. Sometimes CPUs come with different types of cache, usually L1 (Level 1) or L2 (Level 2) cache. The caches are like small pockets of memory that hold some recently-used data. When the CPU needs to go get some data from the RAM, it first checks the L1 cache to see if it has the data, and tries to get the data from there first. If L1 doesn't have the data, it then checks L2, and when all else fails, it checks the RAM. Each point is faster than the next, so getting data from L1 cache is the fastest, L2 is slightly slower, and RAM is slower still. Keep in mind this is all RELATIVELY speaking - all three of those are extremely fast compared to hard drive data access.
=== VIRTUAL MEMORY / SWAP FILES ===
When a program has been designed in such a way that it REQUIRES itself to be loaded into RAM, but there isn't enough RAM, then the operating system can either complain that there's not enough RAM (and not load the program), or it can use virtual memory. The obvious question here is: what IS virtual memory?
Since RAM and hard drives are both simply places to store data, a computer's operating system can take some free space from your hard drive and PRETEND that it's actually RAM. This "fake" RAM is called virtual memory. To a software program, it LOOKS just like real RAM, and the software will use it, even though it's a lot slower (because it's actually the slow hard drive, not the super-fast RAM).
Virtual memory is the CONCEPT, while a swap file is the actual implementation. It's the equivalent of having an IDEA about something that has 4 wheels that you can drive around versus an actual car that you own. Swap files are implemented in different ways, depending on the operating system, and there are always some small differences here and there. For example, Linux has swap partitions, while newer versions of Windows use "pagefiles." It's all the same thing, really. Hard disk space that the operating system pretends is RAM.
Now that you know what virtual memory is, you should know that virtual memory is Bad (note the capital B - that's how bad it is). It's still just storage on a hard disk, so it's SLOW, just like a hard disk. When you're out of RAM and software is forced to use virtual memory, then that program will run SLOW.
This is why it's better to add more REAL memory - so your programs don't have to run in virtual memory!
It's worth mentioning why virtual memory even exists, if it's so slow. While there were a variety of reasons supporting its development, one of the biggest reasons is cost. Virtual memory uses hard disk space, which is far cheaper and plentiful than real RAM. In its earlier days, most applications were getting to the point of exceeding the amount of RAM available on the computer, and most computers weren't made in a way that allowed them to handle large quantities of RAM. Virtual memory was able to solve that problem and was a crutch for many older computers until newer models came along with greater capabilities and capacities.
Even computers today still make use of virtual memory (although they rely on it far less than they used to). Operating systems can consume large quantities of RAM, sometimes leaving very little left for applications. Virtual memory can allow those applications to run and to scale as necessary (e.g. most browsers consume more and more memory as you continue to browse through different pages, or even by staying on the same page sometimes).
=== HARD DISKS ARE SLOW... KIND OF. ===
I've said multiple times already that hard drives are slow. Some of you might be thinking, "But gr8gonzo, I have a 15k RPM hard drive with a ton of cache and [INSERT SUPER SPECIAL FEATURES HERE]." It doesn't really matter. They're always making faster and faster hard drives with more features, but they're never AS fast as RAM.
Regardless of whether the storage device is RAM or a hard disk or whatever, the basic two elements of speed for storage devices are access (or seek) time and transfer rates. Seek/access time is basically how long it takes for the device to find the data you want. Once it finds the data, the transfer rate is how fast it can read the data.
Think about how you read a book. If you're in a classroom and the professor tells you to read chapter 20, then seek time is how long it takes for you to flip over to chapter 20. The transfer rate is how fast you can read the entire chapter.
If a storage device has a fast transfer rate, but a slow seek time (or slow latency, which I'll explain in a moment), then trying to read 100,000 small files is going to be a really slow process. That's 100,000 slow "seeks" and since they're small files, a fast transfer rate doesn't really make a big difference.
So let's say you have a really fast hard drive. It has a 4ms seek time, and you have 1,000 people trying to access the same file all at the same time (just for illustration purposes). The drive could theoretically take UP TO 4 seconds (1,000 x 4ms = 4000 ms) for all of the 1,000 people to be given that file. (This is ignoring stuff like cache.)
Another factor is latency. Going back to the analogy of reading a book, let's say you've arrived at chapter 20 in the book and you've read it. Now the teacher is asking you to read the first page of chapter 20 again. Assuming you're still on the correct page (eliminating seek time), your eyes still need to go from the bottom of the page to the top of the page. That short delay is latency. On a hard drive, latency is the time it takes for the hard drive to get to the point where it can begin re-reading the same data, and again, it's measured in milliseconds (usually).
That may not sound slow to you, but operating systems are accessing the hard disk all the time and reading and writing files. Most servers have multiple users accessing the server simultaneously and doing different tasks that often require the hard disk to read and write multiple files. When you add up all the activity, you can start to see how seek times and latency can really influence performance. And that's not including things like file fragmentation, or file allocation issues, which are subjects for a different article.
Hard drive caches can help by speeding up access to frequently-accessed files, but that's about as good as it gets.
Now contrast a seek time of 4 milliseconds with a seek time of 5 NANO
seconds (or faster)! That's how fast memory can be (it varies, but almost all memory has seek times measured in nanoseconds, usually between 4 and 30 nanoseconds). RAM can also transfer data MUCH faster than hard drives, so it is faster than the hard drive in every way.
One last note on hard drive speeds - hard drives are frequently measured in RPMs (revolutions per minute), most commonly 5400, 7200, 10k, or 15k. This is just referring to how fast the disk spins around inside. The faster it spins, the faster data can be read. Without getting into too much detail, RPMs are a little like speeding in your car. Imagine you're trying to drive to a location that is 30 miles away. The faster you go, the quicker you'll get there, but at a certain point, it's not worth going much faster. For example:
1 mile per hour = 30 hours to get there
15 mph = 2 hours to get there (you've saved 28 hours)
30 mph = 1 hour to get there (you've saved 1 hour)
120 mph = 15 minutes to get there (you've saved 45 minutes)
240 mph = 7.5 minutes to get there (assuming you don't crash)
That said, there's a huge difference between 5400 RPM drives and 7200 RPM drives, but not THAT big of a difference between 10k and 15k RPM drives. Still, if you can afford to pay for extra speed, sometimes every ounce of speed can count.
=== CACHING ===
In the storage world, "caching" just means that the storage device is storing data in a special location that can be accessed faster than normal. For example, hard drives have caches so they can store frequently-used files in there, and the bigger the cache, the more files can be stored for faster access.
In recent versions of Windows (starting with Vista), Microsoft has introduced a technique of using flash memory (like a USB thumb drive) as an extra disk cache to help speed up disk activity (flash memory is often faster than normal hard drives). It's called ReadyBoost, and you can read more about it here:
=== SOLID STATE DRIVES ===
The newest popular hard drive technology is called solid state. Basically, solid state drives (or SSD for short) are built in a way that is sort of similar to RAM, so it's like having a hard drive that is almost as fast as RAM, but can store data permanently like a normal hard drive.
The speed of SSD is incredible enough that it can DRAMATICALLY boost the performance of nearly ANYTHING out there today. People have reported performance increases of hundreds (sometimes thousands) of percent. Of course, it's also very expensive technology, and it has a shorter lifespan than most regular hard drives. However, these drives are getting cheaper and getting longer lifespans every day, so it won't be long before "server-class" SSDs are available for use in servers everywhere. It is an exciting prospect for most system admins.
=== IDE, SAS, SCSI, and SATA ===
(pronounced "scuzzy") stands for Small Computer System Interface. I had to look that up because I never remember (because nobody ever calls it by its full name). However, SCSI -is- an important concept. There are a lot of details to it, but it's a little bit like USB. SCSI itself is just a way of connecting devices, and those devices are USUALLY SCSI hard drives. A SCSI hard drive is usually pretty fast, but it requires a SCSI controller card to be installed in the computer, which can sometimes be a pain to set up (it's gotten easier over time). For the purposes of this article, just think of a SCSI hard drive as an old-but-still-fast hard drive. (I should note that I'm referring to the whole setup of the drive and controller card, not just the drive.)
(or Serial-Attached SCSI) is the newest form of SCSI. It's as-fast or faster than old SCSI drive technology, and still requires a controller card. These newer SAS controller cards have some speed advantages, and (although it doesn't usually make sense to do this) you can connect newer SATA drives to a SAS controller. At this point, SAS is arguably the best storage technology for servers.
(Serial ATA) is what most normal PCs, desktops, laptops, etc... use for storage. Usually they have 1 or 2 SATA drives and the SATA controller is usually built into the computer's motherboard. SATA is actually very similar to SAS in a lot of ways. Sometimes I simply think of SATA as a lower-grade version of SAS. It has similar transfer speeds, but usually SAS drives are built in a way that makes them more suitable for servers (more durable in terms of being able to receive more activity from more people without crashing).
I purposefully did not disclose bandwidth in the above descriptions because it can be misleading. Some people just think, "SATA can transfer at 3Gb/s, and so can SAS, so they have equivalent speeds and are faster than SCSI's in every way." That's not quite accurate - controller cards are the brains behind how hard drives work together, and they are largely responsible for squeezing all the performance out of hard drives. That said, if you're looking to build a desktop PC, use the latest SATA drives. If you're looking to build a high-performance server, use SAS and don't skimp on the controller card (you'll end up with a fast system that is very reliable).
(or Integrated Drive Electronics) was one of the first mainstream technologies for connecting a hard drive to your computer. The primary difference between IDE and other technologies mentioned above is that the drive controller (the part of the system that acts like the traffic cop for your hard drive) is actually embedded in the hard drive itself, while the drive controllers for SAS, SATA, and SCSI are all usually on the motherboard or on some add-on card in the system. As a result, trying to create an array with IDE disks was a bit of a pain to do (imagine five or six traffic cops each trying to control the same intersection in their own way instead of just one traffic cop - it just gets messy).
A bit after IDE was created, a new variant called Enhanced IDE (or EIDE) came out that was very similar but could handle more types of devices and also was capable of more data bandwidth and handling larger data capacities. Both EIDE and IDE were both dependent on an underlying technology called Parallel ATA (or PATA for short), so they both still had some similar limitations (e.g. cables couldn't be more than a foot and a half in length). It's pretty easy to see why the other technologies have surpassed IDE/EIDE in popularity, but you still see some IDE/EIDE disks every once in a while (usually as external hard disks or disks meant for upgrading older computers).
=== RAID ===
RAID stands for "Redundant Array of Inexpensive Disks." In English, it's a way for you to set up 2 or more hard drives (of any type - SCSI/SAS/SATA or whatever) so that they act a little differently. When you combine the hard drives using RAID technology, you create a "RAID array", which is nothing more than a term to describe that combination of drives. Now, there are different ways of combining the drives, and each way has different advantages and disadvantages. These ways are called RAID "levels" - here's a quick list of common one RAID levels and what their advantages/disadvantages are:
"Striping" just means that the storage from all the drives is combined together so that the operating system THINKS it's all one big hard drive (it has no idea that it's really multiple hard drives). You need at least 2 drives for RAID 0.
Data is spread out over the different REAL hard drives, so that each hard drive can work at the same time. When you have multiple hard drives working at the same time, data gets written and read faster.
Know the old saying, "a chain is only as strong as its weakest link" ? An array that is set up to use RAID 0 is like a chain. If any of the hard drives fail, you lose a link that keeps the chain together, and so you'll lose all your data (even if the other drives haven't failed).
"Mirroring" is really simple. It's just a safety net. Anytime any data is written to one hard drive, the exact same data is written to another hard drive, too. Each drive is an identical "mirror" of another drive. You need at least 2 drives for RAID 1.
It's very safe. If you lose one drive, you don't lose any data - everything just keeps working. In the meantime, you simply replace the failed hard drive.
It's the slowest of all the RAID levels, because the same data has to be written multiple times. (Of course, none of the RAID levels are really slow, but RAID 1 is the least fast one.) Your total space is cut in half, since half of the drives in the array are only used as backups.
"Striping with Parity" is a little like RAID 0 in that it is like a chain. The main difference is that each drive contains a chunk of information (called "parity") about what is on the other drives. This way, if one drive fails, the other drives know enough to keep working until you replace the failed drive. It's like having a layer of duct tape around your chain. You need at least 3 drives for RAID 5.
You get SOME of the speed of RAID 0, since the data is spread out. However, since the data DOES have to be written multiple times (to different drives in order to be safe), it's not AS fast as 0. Another advantage is that you can lose one hard drive without losing all your data (just replace the failed hard drive).
There's some disagreement on how safe RAID 5 is. If two drives fail at the same time, you lose everything.
RAID 6 is very much like RAID 5, but has the additional benefit of allowing you to have up to two drives fail without losing your data. On the downside, it needs one more drive, so you need at least 4 drives for RAID 6.
You get the same speed advantages of RAID 5, you still retain most of your hard drive space, and you can lose two hard drives without losing all your data (just replace the two failed hard drives).
Support for RAID 6 isn't as widespread as RAID 5, so it may not be a choice for some systems. If you can spare the extra drive and have the option of using RAID 6 over RAID 5, do it.
"Striping AND Mirroring" is the best of RAID 1 and RAID 0. RAID 10 (also known as RAID 1+0) is like having two different RAID 0 arrays, but both arrays are mirrored (like RAID 1, except the whole array is mirrored instead of just a drive). You need at least 4 drives for RAID 10.
You get almost all of the speed of RAID 0 (since you have multiple hard drives working together to read/write data), and if any of the drives fail in one of those RAID 0 arrays, the other RAID 0 array will kick in automatically. In the meantime you replace the failed drive, and all is fine again.
You need at least 4 drives, and you only get to half of the capacity of all the drives in the system (because of mirroring), so if you want a big RAID 10 array, you'll need lots of hard drives.
There are other RAID levels, but the above are the most common ones.
RAID controllers are responsible for knowing what levels of RAID you can use, and there are two types of RAID controllers: hardware RAID controllers and software RAID controllers. Put simply, hardware RAID controllers are more expensive, but are FAR FAR FAR better in terms of performance. Software RAID is done by the operating system, and has to be processed like anything else in the system, so that processing slows down the RAID a lot. Plus, the operating system has to take care of all the other things going on in the system, too, so that slows down software RAID even further.
Hardware RAID controllers are dedicated to the job, however, so that's all they're thinking about doing all day. Hardware RAID controllers also have dedicated cache and are optimized to know the best ways of running a RAID array. I don't believe I've ever seen a situation where I'd recommend using software RAID, but it's worth knowing about.
That's it, kids! Remember that this is a crash course and does not cover a lot of the nitty-gritty details. IT people are creative and can figure out ways of making everything perform better, so there are always exceptions, but you should at least have a good grasp on the concepts now. Until next time...
Copyright © 2009 - Jonathan Hilgeman. All Rights Reserved.