LAN DESIGN - Part 2 |
|
by Wayne M. Krakau - Chicago Computer Guide,
November 1994 - NewsWare, May 1996 |
|
This article is a continuation of the Network Design
theme of my previous column. Last month I covered human resources, software, and cable
plants. This month I will start with reliability issues. |
The ability to keep on running after a failure is
called fault tolerance. There are a near-infinite number of shades of fault tolerance that
you can use, depending on how risk averse you and your organization are. Risk aversion is
a term commonly used in the insurance industry and in stock trading to describe what Clint
Eastwood characterized a bit more graphically on film. While pointing a gun at a
criminal's head, Clint (portraying "Dirty Harry") said "Do you feel
lucky?". That's what it boils down to. How lucky do you feel and what is the cost of
guessing wrong? (That sound you hear is the distinctive click of the hammer coming down on
an empty cylinder.) |
The ultimate in safety is the use of redundancy. For
instance, using twin servers and Novell's SFT (that's System Fault Tolerance) Level III
allows one server to take over from the other automatically and transparently during a
system failure or even during routine maintenance and upgrades. There are several less
expensive aftermarket options available that will allow this trick with a somewhat lower
level of transparency. With these alternate products, you have to physically intervene to
get up and running on the redundant server. |
Many of my clients implement a Secondary File
Server/Workstation. It runs day-to-day as an unassuming workstation, but, in times of
crisis, can be rebooted as - a file server! (You were expecting Superman?) Most of these
secondaries have an extra hard disk with the same capacity as the one in the main file
server. Some have just enough capacity to run critical applications, though that makes
them much more difficult to maintain. Either way, one just needs to apply the previous
night's backup to be up and running in case of a main file server crash. For an even
quicker, response, apply the backup every morning as part of the daily routine. In an
emergency, just shut down the malfunctioning main server and reboot the secondary as a
file server. You will only lose an average of a half day of work with minimum downtime.
Since the computer is normally used as a workstation, the cost of having this standby
capability is quite small. Just add in the cost of an extra hard disk, maybe a little
extra memory, possibly an upgrade to the next faster CPU or underlying architecture, plus
a little extra labor to set it up. As I said, this option has been used successfully by
many of my clients for the past several years. |
Another way to increase reliability is to protect
the part of the system most likely to fail - the hard disk. Mirroring means using a
redundant second hard disk, controlled by the CPU of the file server. If one disk fails,
the other takes over. |
A few years ago, I received a phone call for help
with a routine WordPerfect problem from a client. Just as we were both hanging up after
resolving the problem, the client suddenly yelled my name. Luckily, I heard him. He
explained that "By the way" (like it was no big thing), he had noticed a
"FUNNY" message on the file server screen. After my pulse dropped back below 180
and I regained the ability to speak coherently, I calmly (or at least as calmly as
possible), noted that, while using the phrases "funny message" and "file
server" in the same sentence was linguistically and grammatically correct, it was,
perhaps, not an acceptable combination for use in polite society. |
I encouraged him to read the message to me. The
message was "Disk 0 failed"! I asked him how long the message had been there. He
told me it appeared three days prior to this conversation! After another pause, to regain
my composure and carefully choose my words, I told him that he owed me a dinner. Why?
Because I had used every resource short of physical violence to convince him to buy into a
redundant disk system (in this case mirrored) and he had given me an incredible amount of
grief on the issue every step of the way! The mirrored disk had saved his company the many
thousands of dollars per hour that it would have cost to have the LAN down. I replaced the
bad disk on the next Saturday, while encouraging the early reporting of any future
"FUNNY" messages. (I never got the dinner, but I did get a reasonably good lunch
out of the deal.) |
Duplexing is the next step up from mirroring. It
means using two intelligent controllers with two hard disks, so that the controllers
relieve the file server of the extra burden of tracking the redundant disk operations. Not
only does this remove the extra processing overhead - it actually speeds up the system due
to the ability to do split reads, in which each read request goes to the disk that is less
busy. This performance boost is in addition, of course, to providing redundancy for the
controllers as well as hard disks. |
The best way to protect the disk channel while
getting an incredible boost in performance is to use a RAID system instead of a SLED
system. SLED means Single Large Expensive Disk while RAID means Redundant Array of
Inexpensive Disks. Note that mirroring is really a simple form of RAID. |
RAID devices are divided into categories called
levels as follows: |
|
Level 0- Data striping without parity. That means
data is spread out over multiple disks for speed. |
Level 1- Mirrored disk array. For every data disk
there is a redundant twin. Also includes duplexing, the use of dual intelligent
controllers for additional speed and reliability. |
Level 2 - Bit interleaves data across array, reading
using only whole sectors. |
Level 3 - Parallel disk array. Disk striping with
dedicated parity drives. Drives are synchronized for efficiency in large parallel data
transfers. |
Level 4 - Independent disk array. Reads and writes
on independent drives in the array with dedicated parity drive using sector-level
interleave. |
Level 5 - Independent disk array. Reads and writes
data and parity across all disks with no dedicated parity drive. Uses parallel transfers.
Multiple controllers optionally used for higher speed. Usually loses only the equivalent
of one drive of array for redundancy. This system is the most popular these days due to
both speed and cost effectiveness. |
Due to recent speed and efficiency improvements in
RAID Level 5, I now recommend it to most my clients. In general, using RAID 5 means that
you only lose the capacity of one drive within the array. In an array of seven disks, for
example, you would lose only one-seventh of the total array capacity to redundancy (though
the actual redundant data is spread across all of the drives - there is no single
redundant drive). For an added boost in performance, you can use multiple controllers with
a single array, switching from software-based (using the file server's CPU) to
hardware-based RAID (using controllers with their own processors) when maximum performance
is required. |
Using external RAID systems in conjunction with the
Secondary Server/Workstation concept mentioned previously results in a very efficient
disaster recovery plan. When the main server goes down, merely plug the RAID system into
the back of the secondary server, reboot, and you are on line immediately with files that
are only minutes old! |
An important factor that can affect LAN reliability
is power protection. Can I assume that everyone by now knows that you must protect a file
server with a high quality uninterruptible power supply and connect that UPS to the file
server with an intelligent communications link? Well, if you didn't know it before,
consider yourself informed. It is required! |
Now for a little quiz. Can you name the best
electrical conductor? (Jeopardy song.) Gold! How about the second best? (Jeopardy song.)
Silver! Now, what's the third best? (Jeopardy song.) Copper! Now, can you name the
material, excluding fiberoptics, inside LAN cables? (Jeopardy song.) Copper! The question
for bonus round is based on that last answer. Can you guess what you call a file server
protected by a high-quality UPS on a LAN where all of the workstations and printers are
properly protected by high-quality surge-suppressors, and there is one unprotected phone
line going into a modem? (Jeopardy song - long version.) The answer is "toast".
Power protection works on the weakest link theory. Even one weak link can destroy a
system. |
If lightning strikes close to a phone cable near
your building and induces a current in the line going into that modem, the power can
easily go through the modem, into the serial port, into the motherboard, into the network
interface card, jump onto the network cable, and spread out over the LAN. Lest you think I
am exaggerating the danger, this happened a few weeks ago to a new client of mine. After
repeated warnings of impending doom, they finally bought a UPS to protect their file
server, but they refused to heed my advice about their inadequate surge suppressors. The
day after they installed a new, faster file server, they called to tell me that it had
died overnight. After asking for additional clues, they mentioned that a workstation had
also bit the dust. It turned out that the building's power had problems overnight. The
resulting overvoltages cooked the workstation. The file server was next in line on the LAN
cable, and it got fricasseed, too. Please don't make the same mistake. Good power
protection is relatively inexpensive to implement and very expensive to omit. |
I will continue with the ongoing theme of LAN Design
next month. Note that all contestants receive the home version of the LAN Design game as a
parting gift. |
|
©1994, Wayne M. Krakau |