SFT III-PEAT, Part Two

by Wayne M. Krakau - Chicago Computer Guide, September 1993

This is a continuation of last month's column on Netware V3.11 SFT III (System Fault Tolerant Level Three), Novell's method of using duplicate servers to increase network reliability.

SFT III splits the Netware V3.11 NOS (Network Operating System) into two parts. The first part, the IO (Input/Output) Engine, contains all programs that require direct manipulation of hardware. Even NLMs (Netware Loadable Modules), Novell's technique for adding features directly to Netware, are covered by this rule. NLMs that control hardware, such as NIC (Network Interface Card) and disk drivers, must be loaded on the IO Engine.

Since each of the Mirrored Servers is a separate machine, with its own unique hardware, there are two IO Engines, one for each machine. These are named by the system installer just like any other Netware server. While there are no real default names, the documentation refers to the two servers as LEFT_IO and RIGHT_IO, with the designations arbitrarily assigned.

The installer can give them more descriptive names, based on their locations, if she or he wishes. That would make for less ambiguity when documenting procedures. It is easier to remember that a server called COMP_IO is in the computer room than remembering which way you should be facing to locate a cryptically named LEFT_IO. In spite of this suggestion, it is easier to use the common, documented names of LEFT_IO and RIGHT_IO in a general discussion of SFT III characteristics.

The other part of netware is called the MS Engine where MS stands for Mirrored Server. It is a logical, not physical entity, but it is named like any other server. The rest of the network sees the MS Engine as the lone representative of the SFT III mirrored servers. It runs concurrently on both of the mirrored servers and handles those tasks that are independent of the underlying hardware details. NLMs that monitor or govern the more global aspects of the network must be loaded on the MS Engine. Note that the use of the singular term MS Engine is proper, even though it runs on both physical servers. An NLM that is loaded on the MS Engine automatically runs on both physical machines as part of the SFT III structure.

Any NLM that runs on the IO Engine must be separately loaded on each physical server. In some cases, depending on the vendor's licensing policies, two copies of an NLM may need to be purchased to avoid piracy allegations. It is not always obvious which engine is appropriate for a particular NLM. When in doubt, remember you can always RTFM (Read the F...... Manual)!

Like NLMs, many SET commands and some other procedures are specific to either the IO Engine or the MS Engine. A few, like the TRACK ON/TRACK OFF diagnostic procedures, can be run in either IO or MS Engines. For instance, you can have three sets of Tracking screens running at once, one each for the LEFT_IO Engine, the RIGHT_IO Engine, and the MS Engine. You can then switch among them using the standard ALT-ESC or CTRL-ESC keystrokes.

Since the two physical servers operate together, you can even switch amongst all available screens from both physical servers and the logical MS Engine while working on just one server's console. All commands, including the vital LOAD and UNLOAD operators are available from that single console.

An interesting and valuable side-effect of the split of the Netware NOS into two pieces is that you can easily run it on a single dual-processor machine, ignore the warning messages regarding the lack of a "twin", and have the workload of Netware split between the two processors. Only a few machines have the appropriate drivers for this trick, but the performance is said to be awesome. In theory, this feat could be extended to use two mirrored servers, each with two processors, but I haven't seen any performance results using that combination -- yet.

When starting up an SFT III system, the first server to load the MS Engine using the new ACTIVATE SERVER command automatically becomes the Primary Server, while the second becomes the Secondary Server. The choice of which to start first, LEFT_IO or RIGHT_IO, is completely arbitrary, though a "Preferred Primary Server" (my own term, not Novell's) could be assigned to ease documentation and training. It is awkward to teach someone to flip a coin to decide which machine to start first -- better to tell them to try a specific machine first.

The two servers share the same name and identity on the network via the MS Engines. The MS Engines communicate via the two IO Engines, since the IO Engines do all the talking to physical devices such as NICs. The main communication is over the MSL (Mirrored Server Link) NICs. This link allows the MS Engine to keep the memory and disk data of the two separate machines completely in sync. Both servers also monitor the regular network traffic to make sure the other is still active.

If the Primary Server fails, either fully or partially, the Secondary Server will detect it via either the MSL or the regular NIC (or NICs). It will then take over as the Primary Server. When the former Primary Server is reactivated, it will discover that a Primary Server already exists and will take up the duties of the Secondary Server. First, it will synchronize its memory with the Primary Server's. Then, the server's will compare dates and times on their volumes and begin a remirroring process from the up-to-date server to the out-of-date server. When the remirroring process is complete, full redundancy is available and the system is ready to incur another fault.

The SFT III initialization process is governed by three pairs of files that logically correspond to the STARTUP.NCF and AUTOEXEC.NCF in regular Netware V3.11. Each IO Engine has its own IOSTART.NCF and IOAUTO.NCF, and the MS Engine has an MSSTART.NCF and an MSAUTO.NCF. The IOSTART files are executed first. Then the MSSTART and MSAUTO run. Finally, the two IOAUTO files are executed.

A small complication is that the execution of the IOAUTO.NCF files requires an active MS Engine with a mounted SYS Volume (the mandatory name for the first disk volume in a Netware server). When starting a collective SFT III from a complete stop (both physical servers shut down), the IOAUTO won't execute since an MS Engine and a SYS Volume aren't already up and running.

The solution for this is to use a little-known but incredibly useful feature of Netware V3.11 - batch files. A raw ASCII text file with valid console commands (LOAD, BIND, SET, etc.) can be executed from the colon prompt on the server's console just like a batch file can be executed at the DOS prompt. The filename extension must be "NCF". The file should be stored in SYS:SYSTEM (the SYSTEM directory of the volume SYS), since that is the default directory for console commands.

If you put a copy of the commands from the IOSTART.NCF file into a file while following these rules (I called mine START.NCF), you can manually get around the initial start problem. After the MS Engine is alive, and the SYS Volume is mounted, just use either CTRL-ESC or ALT-ESC to get to the IO Engine console screen. Then type the first part of the filename (eg. START) to execute your batch file.

When one of the two physical file servers goes down without a catastrophic hardware problem, Netware is automatically restarted without falling back to a DOS prompt or fully rebooting the server. As long as the other server stays up, the newly restarted server sees an active MS Engine and an accessible SYS Volume. This allows the IOAUTO.NCF for that server to execute automatically, without human intervention.

Next month, I'll cover how to make Netware 3.11 SFT III really work.

                                        ©1993, Wayne M. Krakau