Want to set up pair of matching systems for "near online" redundancy

Posted on 2006-04-18
Last Modified: 2010-04-25
I have a customer who is located in a remote area.  They have frequent power outages and get hit with some pretty tough weather - (lightening strikes).  Over the years, I was a proponent of separate systems for each of their industrial functions.  My thinking was we were spreading the risk over more systems.  We had a spare parts kit available.  This worked very well for many years - good old single user DOS days.

This past year they have upgraded their control software to the XP version - and in the process, consolidated 3 functions that were running on 3 individual machines onto a single box.  When this machine goes down - their ability to produce is gone with it.

Here is the situation I would like to create for them.

I want the highest level of redundancy for every component in their desktop system.  I want to be able to direct them (over the phone) to replace (swap) any component out with a spare replacement in order to get themselves back up and working in the shortest possible time.  The solution can include 2 matched desktops, matched in configuration and setup, plus a complete spares kit.  The solution must cover all Windows XP activation issues - in the event of a swap of hard disk from one matching system to the other.  This is the "near online" backup situation.

A fully redundant "on line backup" situation would also be considered.
Question by:Computer_Dan2
    LVL 69

    Accepted Solution

    Just as a base backup, I would recommend a hot swappable RAID-1 mirror that you change out the drive at the end of every day and store it.  Put a new drive in to replace it and let the RAID array rebuild.  Keep an identical system offline and isolated, and don't even plug in the power; plug in the drive from the mirror when switching online.  You can keep spares around for the most redundancy, but you could also order replacement parts for the failed system while the backup is in use, depending on how risky you think it is to run until a backup is available.

    To avoid the activation issue, you could setup two identical systems but mirror just the data.  At worst, you may need to go back one day, but you may be able to use the mirror as it was at failure time.
    LVL 2

    Assisted Solution

    i agree with callandor
    your raid array is going to be the only way of having a live backup or redundancy.
    the only other option is an end of day full system backup to a "server" machine or external storage device that way when you bring in the new system just restore from your backup but again it is not going to be real time so if you fail you still have to start from the time of the backup and re-enter the data.

    if your environment is that extreme i would implement both methods.
    run your raid array for a first level defense so that if only one harddrive fails you don't have to go through the hassle of replaceing a whole system but an extreme power surge could take out both hard drives so also run a regular full system backup to a device that can be plugged in and then unplugged when the backup is not being run. that way if you have to pull the entire system and replace it with a spare you can still recover but just as callandor said make sure your hardware on the spare system and your replacement hard drives are the same as your originals

    LVL 3

    Assisted Solution

    Given that this sounds like its at the top of Mt Washington, my first recommendation is that they do everything possible to prevent damage in the first place. The ideas the guys have come up with are excellent, but you would be far better off avoiding the damage in the first place.

    1) Install filters on EVERYTHING! It sounds like the most susceptible to damage will be the i/f hardware in line between the contrller and the process unit. I suspect that this will be some form of RS232 link (although 1553 is also likely). Make sure this is filtered and has the best possible screening.

    2) Make sure both your primary and backup pc are running on filtered and protected supply (UPS sounds like a good move here).

    3) Build a faraday cage around the pc's (ie put them in a heavily earthed metal box)

    Just a couple of ideas based on the prevention is better than cure premise.


    LVL 87

    Expert Comment

    Use a PC based on server hardware (HP or Dell). These often already come with raid, multiprocessors, and redundant powersupplies and RAM included. Servers normally also use better components which make them more likely to survive situations like you described. Servers usually can also be ordered with extra short reaction notice from the manufacturer if things go wrong.

    Author Comment

    I appreciate all the input.  

    You have helped me put more focus on my question.

    You are all right in your comments.  In this particular situation we have done quite a bit to isolate and protect the machine(s) involved in these processes.  Lots of isolation devices on the lines leading to the PC and on the plant functions, lots of surge protection and battery backup capacity.  The Faraday Box idea was a good one - will investigate it.

    I had considered a IDE boot drive with a SATA Raid-1 drive setup behind it.  All data would be on the RAID setup, all software installed on the IDE.  The backup machine would be setup the same way.  In the past (with multiple vendors) I have had problems promoting the second drive in a RAID-1 setup to a primary for boot purposes.

    This "disaster recovery" stuff is always tough because the simplest forgotten or overlooked item can bite you big time.

    I am looking for a more detailed (sort of step by step) configuration with more detail.

    LVL 70

    Assisted Solution

    A few comments (some have already been made above):

    First, it is crystal clear that the system(s) need to be exceptionally well grounded, and should be protected by a high-end UPS system ==> I'd use a true sine wave UPS unit here.   The concept of incorporating a shielded enclosure is a good one as well.

    Second, depending on just how "near real-time" the backup needs to be online, you may want to use a set of Windows 2003 systems operating as a cluster -- assuming the "XP version" of the software you need to run works okay on 2003.   This is a more expensive and complicated setup; but will provide continuous operation in the event of a single system failing.  In addition, most servers will support redundant power supplies, so the likelihood of failure, particularly with a good UPS, is quite small -- and the likelihood of BOTH systems in the cluster failing is indeed very tiny.   The only issue I can think of for this is the switching of the control elements -- most industrial control devices are designed for this, but you weren't real clear exactly what the interface is to the devices being controlled.

    If you don't want to use clustering (or want to stay with desktop solutions); then I agree with the concept of using two distinct systems rather than trying to swap parts, as this completely avoids potential activation issues.     Rather than trying to implement some RAID-swapping system (as suggested above) to keep the data current, I would keep the data on a protected network device like a Buffalo TeraStation that could be accessed by both systems.   The TeraStation has built-in RAID, and could be configured for full RAID-1 redundancy.   In the event of a system failure, they need only turn on the 2nd system, and it will have access to the most current data on the TeraStation with no intervention on the user's part.

    LVL 26

    Expert Comment

    Just a little point. ~ One of those "little things" that can bite you...

    I think everyone whos' posted in this thread understands the difference between shutting down a computer and unplugging it. ~~ But your client and the employees there may not.
    For that unplugged back-up machine...... Some training may be required.
    Make sure they know unplugged means *UNPLUGGED*.

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Maximize Your Threat Intelligence Reporting

    Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

    No matter the version of Windows you are using, you may have some problems with Windows Search running too slow or possibly not running at all. Before jumping into how you can solve this issue, just know there are many other viable alternative deskt…
    I use more than 1 computer in my office for various reasons. Multiple keyboards and mice take up more than just extra space, they make working a little more complicated. Using one mouse and keyboard for all of my computers makes life easier. This co…
    Migrating to Microsoft Office 365 is becoming increasingly popular for organizations both large and small. If you have made the leap to Microsoft’s cloud platform, you know that you will need to create a corporate email signature for your Office 365…
    This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor ( If you're looking for how to monitor bandwidth using netflow or packet s…

    759 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    11 Experts available now in Live!

    Get 1:1 Help Now