Want to set up pair of matching systems for "near online" redundancy

I have a customer who is located in a remote area.  They have frequent power outages and get hit with some pretty tough weather - (lightening strikes).  Over the years, I was a proponent of separate systems for each of their industrial functions.  My thinking was we were spreading the risk over more systems.  We had a spare parts kit available.  This worked very well for many years - good old single user DOS days.

This past year they have upgraded their control software to the XP version - and in the process, consolidated 3 functions that were running on 3 individual machines onto a single box.  When this machine goes down - their ability to produce is gone with it.

Here is the situation I would like to create for them.

I want the highest level of redundancy for every component in their desktop system.  I want to be able to direct them (over the phone) to replace (swap) any component out with a spare replacement in order to get themselves back up and working in the shortest possible time.  The solution can include 2 matched desktops, matched in configuration and setup, plus a complete spares kit.  The solution must cover all Windows XP activation issues - in the event of a swap of hard disk from one matching system to the other.  This is the "near online" backup situation.

A fully redundant "on line backup" situation would also be considered.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Just as a base backup, I would recommend a hot swappable RAID-1 mirror that you change out the drive at the end of every day and store it.  Put a new drive in to replace it and let the RAID array rebuild.  Keep an identical system offline and isolated, and don't even plug in the power; plug in the drive from the mirror when switching online.  You can keep spares around for the most redundancy, but you could also order replacement parts for the failed system while the backup is in use, depending on how risky you think it is to run until a backup is available.

To avoid the activation issue, you could setup two identical systems but mirror just the data.  At worst, you may need to go back one day, but you may be able to use the mirror as it was at failure time.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
i agree with callandor
your raid array is going to be the only way of having a live backup or redundancy.
the only other option is an end of day full system backup to a "server" machine or external storage device that way when you bring in the new system just restore from your backup but again it is not going to be real time so if you fail you still have to start from the time of the backup and re-enter the data.

if your environment is that extreme i would implement both methods.
run your raid array for a first level defense so that if only one harddrive fails you don't have to go through the hassle of replaceing a whole system but an extreme power surge could take out both hard drives so also run a regular full system backup to a device that can be plugged in and then unplugged when the backup is not being run. that way if you have to pull the entire system and replace it with a spare you can still recover but just as callandor said make sure your hardware on the spare system and your replacement hard drives are the same as your originals

Given that this sounds like its at the top of Mt Washington, my first recommendation is that they do everything possible to prevent damage in the first place. The ideas the guys have come up with are excellent, but you would be far better off avoiding the damage in the first place.

1) Install filters on EVERYTHING! It sounds like the most susceptible to damage will be the i/f hardware in line between the contrller and the process unit. I suspect that this will be some form of RS232 link (although 1553 is also likely). Make sure this is filtered and has the best possible screening.

2) Make sure both your primary and backup pc are running on filtered and protected supply (UPS sounds like a good move here).

3) Build a faraday cage around the pc's (ie put them in a heavily earthed metal box)

Just a couple of ideas based on the prevention is better than cure premise.


Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Use a PC based on server hardware (HP or Dell). These often already come with raid, multiprocessors, and redundant powersupplies and RAM included. Servers normally also use better components which make them more likely to survive situations like you described. Servers usually can also be ordered with extra short reaction notice from the manufacturer if things go wrong.
Computer_Dan2Author Commented:
I appreciate all the input.  

You have helped me put more focus on my question.

You are all right in your comments.  In this particular situation we have done quite a bit to isolate and protect the machine(s) involved in these processes.  Lots of isolation devices on the lines leading to the PC and on the plant functions, lots of surge protection and battery backup capacity.  The Faraday Box idea was a good one - will investigate it.

I had considered a IDE boot drive with a SATA Raid-1 drive setup behind it.  All data would be on the RAID setup, all software installed on the IDE.  The backup machine would be setup the same way.  In the past (with multiple vendors) I have had problems promoting the second drive in a RAID-1 setup to a primary for boot purposes.

This "disaster recovery" stuff is always tough because the simplest forgotten or overlooked item can bite you big time.

I am looking for a more detailed (sort of step by step) configuration with more detail.

Gary CaseRetiredCommented:
A few comments (some have already been made above):

First, it is crystal clear that the system(s) need to be exceptionally well grounded, and should be protected by a high-end UPS system ==> I'd use a true sine wave UPS unit here.   The concept of incorporating a shielded enclosure is a good one as well.

Second, depending on just how "near real-time" the backup needs to be online, you may want to use a set of Windows 2003 systems operating as a cluster -- assuming the "XP version" of the software you need to run works okay on 2003.   This is a more expensive and complicated setup; but will provide continuous operation in the event of a single system failing.  In addition, most servers will support redundant power supplies, so the likelihood of failure, particularly with a good UPS, is quite small -- and the likelihood of BOTH systems in the cluster failing is indeed very tiny.   The only issue I can think of for this is the switching of the control elements -- most industrial control devices are designed for this, but you weren't real clear exactly what the interface is to the devices being controlled.

If you don't want to use clustering (or want to stay with desktop solutions); then I agree with the concept of using two distinct systems rather than trying to swap parts, as this completely avoids potential activation issues.     Rather than trying to implement some RAID-swapping system (as suggested above) to keep the data current, I would keep the data on a protected network device like a Buffalo TeraStation that could be accessed by both systems.   The TeraStation has built-in RAID, and could be configured for full RAID-1 redundancy.   In the event of a system failure, they need only turn on the 2nd system, and it will have access to the most current data on the TeraStation with no intervention on the user's part.

Just a little point. ~ One of those "little things" that can bite you...

I think everyone whos' posted in this thread understands the difference between shutting down a computer and unplugging it. ~~ But your client and the employees there may not.
For that unplugged back-up machine...... Some training may be required.
Make sure they know unplugged means *UNPLUGGED*.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.