by Wayne M. Krakau - Chicago Computer Guide, November 1994 - NewsWare, May 1996

This article is a continuation of the Network Design theme of my previous column. Last month I covered human resources, software, and cable plants. This month I will start with reliability issues.

The ability to keep on running after a failure is called fault tolerance. There are a near-infinite number of shades of fault tolerance that you can use, depending on how risk averse you and your organization are. Risk aversion is a term commonly used in the insurance industry and in stock trading to describe what Clint Eastwood characterized a bit more graphically on film. While pointing a gun at a criminal's head, Clint (portraying "Dirty Harry") said "Do you feel lucky?". That's what it boils down to. How lucky do you feel and what is the cost of guessing wrong? (That sound you hear is the distinctive click of the hammer coming down on an empty cylinder.)

The ultimate in safety is the use of redundancy. For instance, using twin servers and Novell's SFT (that's System Fault Tolerance) Level III allows one server to take over from the other automatically and transparently during a system failure or even during routine maintenance and upgrades. There are several less expensive aftermarket options available that will allow this trick with a somewhat lower level of transparency. With these alternate products, you have to physically intervene to get up and running on the redundant server.

Many of my clients implement a Secondary File Server/Workstation. It runs day-to-day as an unassuming workstation, but, in times of crisis, can be rebooted as - a file server! (You were expecting Superman?) Most of these secondaries have an extra hard disk with the same capacity as the one in the main file server. Some have just enough capacity to run critical applications, though that makes them much more difficult to maintain. Either way, one just needs to apply the previous night's backup to be up and running in case of a main file server crash. For an even quicker, response, apply the backup every morning as part of the daily routine. In an emergency, just shut down the malfunctioning main server and reboot the secondary as a file server. You will only lose an average of a half day of work with minimum downtime. Since the computer is normally used as a workstation, the cost of having this standby capability is quite small. Just add in the cost of an extra hard disk, maybe a little extra memory, possibly an upgrade to the next faster CPU or underlying architecture, plus a little extra labor to set it up. As I said, this option has been used successfully by many of my clients for the past several years.

Another way to increase reliability is to protect the part of the system most likely to fail - the hard disk. Mirroring means using a redundant second hard disk, controlled by the CPU of the file server. If one disk fails, the other takes over.

A few years ago, I received a phone call for help with a routine WordPerfect problem from a client. Just as we were both hanging up after resolving the problem, the client suddenly yelled my name. Luckily, I heard him. He explained that "By the way" (like it was no big thing), he had noticed a "FUNNY" message on the file server screen. After my pulse dropped back below 180 and I regained the ability to speak coherently, I calmly (or at least as calmly as possible), noted that, while using the phrases "funny message" and "file server" in the same sentence was linguistically and grammatically correct, it was, perhaps, not an acceptable combination for use in polite society.

I encouraged him to read the message to me. The message was "Disk 0 failed"! I asked him how long the message had been there. He told me it appeared three days prior to this conversation! After another pause, to regain my composure and carefully choose my words, I told him that he owed me a dinner. Why? Because I had used every resource short of physical violence to convince him to buy into a redundant disk system (in this case mirrored) and he had given me an incredible amount of grief on the issue every step of the way! The mirrored disk had saved his company the many thousands of dollars per hour that it would have cost to have the LAN down. I replaced the bad disk on the next Saturday, while encouraging the early reporting of any future "FUNNY" messages. (I never got the dinner, but I did get a reasonably good lunch out of the deal.)

Duplexing is the next step up from mirroring. It means using two intelligent controllers with two hard disks, so that the controllers relieve the file server of the extra burden of tracking the redundant disk operations. Not only does this remove the extra processing overhead - it actually speeds up the system due to the ability to do split reads, in which each read request goes to the disk that is less busy. This performance boost is in addition, of course, to providing redundancy for the controllers as well as hard disks.

The best way to protect the disk channel while getting an incredible boost in performance is to use a RAID system instead of a SLED system. SLED means Single Large Expensive Disk while RAID means Redundant Array of Inexpensive Disks. Note that mirroring is really a simple form of RAID.

RAID devices are divided into categories called levels as follows:

Level 0- Data striping without parity. That means data is spread out over multiple disks for speed.

Level 1- Mirrored disk array. For every data disk there is a redundant twin. Also includes duplexing, the use of dual intelligent controllers for additional speed and reliability.

Level 2 - Bit interleaves data across array, reading using only whole sectors.

Level 3 - Parallel disk array. Disk striping with dedicated parity drives. Drives are synchronized for efficiency in large parallel data transfers.

Level 4 - Independent disk array. Reads and writes on independent drives in the array with dedicated parity drive using sector-level interleave.

Level 5 - Independent disk array. Reads and writes data and parity across all disks with no dedicated parity drive. Uses parallel transfers. Multiple controllers optionally used for higher speed. Usually loses only the equivalent of one drive of array for redundancy. This system is the most popular these days due to both speed and cost effectiveness.

Due to recent speed and efficiency improvements in RAID Level 5, I now recommend it to most my clients. In general, using RAID 5 means that you only lose the capacity of one drive within the array. In an array of seven disks, for example, you would lose only one-seventh of the total array capacity to redundancy (though the actual redundant data is spread across all of the drives - there is no single redundant drive). For an added boost in performance, you can use multiple controllers with a single array, switching from software-based (using the file server's CPU) to hardware-based RAID (using controllers with their own processors) when maximum performance is required.

Using external RAID systems in conjunction with the Secondary Server/Workstation concept mentioned previously results in a very efficient disaster recovery plan. When the main server goes down, merely plug the RAID system into the back of the secondary server, reboot, and you are on line immediately with files that are only minutes old!

An important factor that can affect LAN reliability is power protection. Can I assume that everyone by now knows that you must protect a file server with a high quality uninterruptible power supply and connect that UPS to the file server with an intelligent communications link? Well, if you didn't know it before, consider yourself informed. It is required!

Now for a little quiz. Can you name the best electrical conductor? (Jeopardy song.) Gold! How about the second best? (Jeopardy song.) Silver! Now, what's the third best? (Jeopardy song.) Copper! Now, can you name the material, excluding fiberoptics, inside LAN cables? (Jeopardy song.) Copper! The question for bonus round is based on that last answer. Can you guess what you call a file server protected by a high-quality UPS on a LAN where all of the workstations and printers are properly protected by high-quality surge-suppressors, and there is one unprotected phone line going into a modem? (Jeopardy song - long version.) The answer is "toast". Power protection works on the weakest link theory. Even one weak link can destroy a system.

If lightning strikes close to a phone cable near your building and induces a current in the line going into that modem, the power can easily go through the modem, into the serial port, into the motherboard, into the network interface card, jump onto the network cable, and spread out over the LAN. Lest you think I am exaggerating the danger, this happened a few weeks ago to a new client of mine. After repeated warnings of impending doom, they finally bought a UPS to protect their file server, but they refused to heed my advice about their inadequate surge suppressors. The day after they installed a new, faster file server, they called to tell me that it had died overnight. After asking for additional clues, they mentioned that a workstation had also bit the dust. It turned out that the building's power had problems overnight. The resulting overvoltages cooked the workstation. The file server was next in line on the LAN cable, and it got fricasseed, too. Please don't make the same mistake. Good power protection is relatively inexpensive to implement and very expensive to omit.

I will continue with the ongoing theme of LAN Design next month. Note that all contestants receive the home version of the LAN Design game as a parting gift.

1994, Wayne M. Krakau