Link to home
Start Free TrialLog in
Avatar of michelandre
michelandre

asked on

Legato SatndbyServer, how to reset the Primary when switching

NW 4.11, Compaq 3000
Vinca StandbyServer for NW 4.11, dedicated link,
No UPS but plug into an emergency generator

Problem:
To make sure, in case of switching over, that the Primary server is realy off and it won't come back.

Cause:
In case where the server is so busy or else and doesn't respond to the ping of the standby server the later will timeout then switch to primary mode.
If the original Primary then is no more so busy and come back I will have 2 Primary(s)  with same IP and Servername ( a nightmare...)

Goal:
To reset  the original Primary just before the standby server switches. In that case the original Primary will reboot but will stop in dos mode because no command in autoexec.bat to execute server.exe

Just before switching, you can give a NCF file to execute.
I can use or plug something in the serial or parallel port etc...
or buy somekind of harware...
Avatar of Zombite
Zombite

Seems strange the standby server starts up when the primary is still there but busy... sure it is setup ok ? Maybe you should increase your timeout....
Avatar of michelandre

ASKER

to Zombite:

This is a very critical and important server.
Somebody explained to me that last year they had a server not responding for almost 1 hour then it responded ok after that.
The explaination was that the server was busy compressing a very big file. This is a low thread and it won't release the cpu to the os until finished... (possible ???). Some of the files are a few GB.
The auto restart after abend is set so that if the server abend, it will not try to recover (Primary server)

I was thinking of making a small .exe to execute after the standby is down and before switching from C:\STANDBY to C:\NWSERVER but I noticed the startup server.exe is using a temporary startup.ncf as a parameter while going up so I don't want to mess around with that.

We are using a kind of electronic telereset which is a box with an optodiode input. It just needs a +Vdc and a gound. When the +Vdc is present it activates a relay which cut the power to the AC outlet. I can make a decoder to plug into the parallel port and wait for a certain code to come through. Then the output of the decoder will activate the reset box. The problem with this scenario is that I have to build the electronic and then write an NLM to print something just before the Standby server goes down so it will activate the reset. I don't know how to write such an NLM and there might be some glitches on power up going to the parallel port when it is initializing.
I thought of Winchute with remote down but there is no UPS on that server because it is plugged into  an emergency generator. Even with that the remote server have to be live and responding.
I think the best is to buy something that will do the job.
For sure I am not the first one with such a problem
The secondary server lan card, is that connected to the lan and running all the time, or does it only load and link when the secondary server starts up ?

Just thought you could connect a cutoff to the lan card connection of the primary server and connect the in/out of that to the link led on the secondary server. When the link led comes on the secondary server, disconnects the lan from the primary server.
Either that, or connect your power off device to the link led of the secondary server lan card.

Quite nice. I will investigate that.

The secondary NIC is used for synch of the mirrored disk only.
It is not on the main LAN. IPXRTR routing=none. So there is no routing at all between them.
The standby server has 2 NIC. One for the synch and one for the main LAN. The software pings the PRIMARY throught the NIC attached to the main LAN to see if it is alive.
I could not find the documentation online, but isn't there a script (ncf file)that runs as it switches each way.  Couldn't you just add a line to the script that copies between two versions of AUTOEXEC.BAT?  

One that brings the server up and one that does not using something like Toolbox.

To Ipenrod:
Interresting. As I know, when the stanby server switch over, it goes down landing in the C:\stanby directory because I see the DOS prompt. Then I see a command line starting server  with a -s parameter. I guess that indicate the startup.ncf file. After that I see a startup.tmp or some other extension but the meaning is the same thing i.e "temporary". And there is something else after but can't catch. All that to say that those parameters are built up temporary for the switching over only. I don't want to mess with that yet. I prefer another way if possible.
It is possible to copy a new config.sy and autoexec.bat (while in DOS) but a self modifying object is never recommended so I am searching for the time being and hoping someone here will have a brillant way.

For the documentation:
- as a start:
http://www.legato.com/support/documentation/
- more specific:
http://web1.legato.com/cgi-bin/catalog?sf=Releases&level=1-8
- exactly:
http://web1.legato.com/infodev/publications/standbyserver/standbyserver1.0/sbs.pdf

By the way it is realy the best tool to change a server.


If you were to connect the poweroff device (or even the reset switch) of the primary server to the inactive networkcard link led, you could prevent the primary server from comming up again by renaming the autoexec.ncf on the primary server uning toolbox.

normal boot, at the end of autoexec.ncf use delaycmd to run toolbox to rename the autoexec.ncf.

last lines of autoexec.ncf
load toolbox
load delaycmd 1200 "move sys:system/autoexec.ncf sys:system/autoexec.ded"

the server will not be able to load autoexec.ncf, will stop at prompt asking for server name.

To restart, you would have to manually provide server name and internal net, mount sys and rename autoexec.ncf back again then restart.

Dont know if this will suffice, but it is getting interesting.....
This sounds like a NIC related fault to me. What should happen is that any failure to ping the primary server should be confirmed across the data link.  If  your primary is busy or there is a problem with your network then the standby machine should not take over unless it can't get a good response from the secondary NIC.
Even if your network NIC's aren't 100Mbps your Secondary NIC's should be. If there is processing of large files the both network and secondary NIC's will be busy.
Have there be any recent changes, upgrades are you running with the latest drivers etc.
Is it possible you have an intermittant NIC failure - can you swap out the NIC's with new or known to be good spares?
Now I am investigating the batch files to find an easiest way. I will keep you informed.
Yeah, I would have to agree with Martin, as I started out. It does seem strange that the vinci server kicks in when the other server is busy. Even compressing a large file it should not stop ping
I don't exacly know what was the problem for the server not responding but I suspected file compression or something like that taking all the power of the server and not releasing the CPU for a long time.
The person who reported the problem said the server was not responding for a long time then suddently it was responding.
I don't realy care what was the cause. I just want to make sure that it doen't happen again. I could increase the delay but already the switching over will take about 3 minutes wich is considered very very long. As I wrote before, this server is extremely important.
Also, we have more than 300 servers. If this server comes back after the switch over I don't know what it will do to the tree but one thing is sure "I don't want to see it happening".
There is a repoduction, in closed LAN, of the 5 top servers of the tree in a lab environment. If we have time to try, we will put both servers in the tree with the same name, internal ipx number and, TCP/IP address. We are waiting to try it because it takes quite some time to mount that lab.
I am now checking the batch file that brings up the server in standby mode. It passes, the directory path of the original server.exe and the name of the text file to use when switching, as parameters while it is going up the first time. Also the temp file I wrote about before: I think it is made up of the "load retyes.nlm" and the startup.ncf files.
The retyes is use to return yes to the question of Do you want to mount the volume anyway when the server is booting netware after the failling of the Primary server and the mirror is broken since the Primary is no more there.
What I will try is to send the switching program to another directory instead of c:\nwserver (when it is changing to that directory, the server is in DOS mode). After it changed directory, it wants to execute server.exe with certain parameters (the "execute server.exe" is built in and cannot be change). I will make a server.bat file that will call another program wich will activate the CTS of the serial port which will turn off the power bar that goes to the Primary server. After that the batch file will switch to c:\nwserver directory and execute server.exe with the passed parameters.
I think that this scenario should work.
I will keep you informed of the developments.
Avatar of joopv
If your prim. server is so busy that it won't respond to the vinca pings over a dedicated link, you have another problem that should be solved *first*.
Because in that case the server doesn't respond to the LAN requests also.

Try going into the debugger during this 'hang' situation (shift/shift/alt/esc) and see what nlm's are running.

There are some nice documents on the Novell Knowledgebase concerning high utilization and non-responsiveness and ways to address these issues.

to joopv:

This problem happened last year.
For myself I never saw it.
The problem is that it can happen again.
I just want to make sure that if it does happen again (or some kind of problem like it) I will have a solution.
As Murphy said: "It is not if but when..."
just a stupid thought - but have you set "Upgrade Low Priority Threads" to ON?  If so, turn it off.
I found a solution

SBS goes up by a batch file.
The first parameter is the directory where NW boot i.e. NWSERVER
The 2nd one is the warning text file path.

When it switches, it goes down, then change directory  to the NW dir (parameter passed on the way up) and looks for server.exe, then  it assembles a temporary startup file which is a combination of startup.ncf + load c:\standby\retyes.dsk. ( to answer yes to: The mirror is broken, do you still want to mount sys...)
Then it executes server.exe -s startup.tmp
When up, it deletes startup.tmp

The trick is to replace the server.exe with a server.bat that will do exactly that ( assembles the startup.tmp then execute the server.exe program)

At the beginning of server.bat, I include the "Sentry 2 OFF" which is a program that send a "OFF" command to the com2 port attached IPM (Intelligent Power Module): a commercial device which turns power off to the PRIMARY server power bar.

Now I am sure the PRIMARY is realy off.

It took me some time to find that but I was helped by a old timer that knew tricks.
To come back to the original problem: if the primary server is coming back after a period of 'medidation' and it discovers that another server is already up and running with its license and ip address, it should isolate itself from the network.

I have actually seen this happen at a customers site 2 weeks ago.  The primary server comes back to life after a period of 100% utilization (which still has to be addressed), discovers that another server is online (the secondary that has restarted as primary) with his license and isolates itself from the network, not creating any damage that way.
NW4.11sp6 + vinca sbs4.
In a lab environment, we tried to start both machines in nwserver.
The last one to boot did not bind the IP but it did bind the IPX. We were no more able to login and communication was disastrous.

to joopv:

What happened to NDS?
Nothing special happened with NDS since the failover was complete and the other server (after waking up) had isolated itself from the network.

I guess there is some option in the ipx stack that takes care of the isolation when an internal ipx netw. address conflict is detected ?  Or the license violation takes care of it - not sure about it.
September 27 solution was a good one. It worked perfectly.
I put a SENTRY box on port COM2 of the StandbyServer.
When the server goes down, it send a "PORT  1  OFF"  throught COM2 and the SENTRY activates the RELAY which cut the power to the PRIMARY.
This question has a deletion request Pending
If you delete this question it is lost for the knowledge base.  I know you solved the problem yourself.  But why dispose the question into the bit bucket ?

What else can I do, and How.
ASKER CERTIFIED SOLUTION
Avatar of joopv
joopv
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Well for the knowledge..., I will give it to you.
I have enough points to still ask a few questions if I am stuck in the future.
I learned quite a bit here (technically and else) so ...

Thank you all.

Michel-André
I would like to include a jpeg drawing of the solution but again how
I had a similar case recently - vinca switching over because of the prim. server staying in 100% utilization for some time.  It turned out to be a server-based virus scanner that caused it...

If you want to include a drawing:
convert the drawing to a tekst format, for example using UUENCODE.  Then paste it into a comment.

btw, line drawing are often MUCH smaller if you use GIF format instead of JPG.  JPG is meant for photographic material.  Schematics, line drawing etc are much more efficiently stored in GIF format.

Yesterday (02:30h) we had a call from the CO that the server was not responding to ping and that they cannot see the server.

We turned off compression and everything came back nice.

But the Standby server never switch over. It had a dedicated NIC.

So I imagine if it didn't had a dedicated link it will have switch because there was no ping response. Ping uses TCP/IP not IPX.

We didn't try a ipxping. Maybe it will had responded.

Next time I will try.

By the way, VINCA is the best way to change hardware  machine. It is synchronizing at 20GB/hr with a 100MB NIC.

Also there is no need to install all the program. Just load manually the 4 command and it will see each other.

No need for NLSP that way too.

Great program.