Link to home
Start Free TrialLog in
Avatar of networkn
networkn

asked on

Script to check timestamp and restart service

We are having a problem where everytime the server is restarted, outbound email is not leaving until GEE Whiz and GWIA are restarted.
We are getting connection refused connection errors on the gwia and that is something being taken up with our ISP.

However the way to work around it, is to have it so that should messages in the gwia\sent folder be more than 10 minutes old, then unload and reload gwia.nlm and gee2.nlm.. What I need is a way for this to occur. Someone said script something in perl but I don't know any perl.

Essentially the perl script needs to check and if the oldest message in the queue has been around more than 15 minutes, restart the services mentioned above.. Is there another way? Then how do I schedule that to run every say 20 minutes or so?
Avatar of PsiCop
PsiCop
Flag of United States of America image

Hmmm.... yeah, I could see a Perl script to do this. I dunno if all the modules necessary to implement it properly are included in the Perl distribution on NetWare, tho. And addling modules is, unfortunately, a non-trivial exercise.

If you constructed such a process, you could easily schedule it to run using the included CRON.NLM - a port of the UNIX cron daemon to NetWare that's been part of the OS for years.

Perl is not your only alternative. There's also Novell Script for NetWare (NSN). Check out http://developer.novell.com
I have some Perl-scripting-on-NetWare experience. I recently wrote a Perl script to go thru our entire eDirectory Tree, examining all the Description fields of User objects. Some moron thot it'd be a good idea to put the last 4 digits (altho in some cases, it was the entire thing) of employee SSNs in that field as an "authenticator" when people called our Help Desk asking for password resets.

My Perl program tracks down those numbers, removes them from the Description Attribute (which can have multiple Values, each one a discrete string) while retaining any other information in the Attribute (which was used as a dumping ground for all sortsa stuff that should have been put in other Attributes), and writes it into the workforceID Attribute (for which we have created an ACL to restrict access).

So you can do quite a lot in Perl on NetWare, but the lack of support for system() calls may be an issue in this particular case.
Avatar of billmercer
billmercer

If you've got 6.5, maybe you could do a bash script to handle this (Easier than Perl IMHO.)

For that matter, you could just use a simple NCF file to unload and reload GWIA, and run it on a scheduled basis. That would be good enough as a temporary solution until you resolve your ISP issues.

I'm also wondering if you can prevent the problem from happening in the first place by futzing with load order. Have you tried delaying the GWIA load for a while after restart?
Is there a firewall between your GWIA and the ISP?  What is it?  

Could be the firewall is futzing enough with comm at startup time to make GWIA's "first try" attempt at establishing a link fail.

Or, as billmercer mentioned, maybe GWIA's just coming up too soon.

You're not using DHCP for your server address(es) are you?
I guess the next logical question would be: why would you be rebooting your server so often that this would even be an issue?  The last time I restarted my servers was 3 months ago when I applied some patches that required a restart.  My at-home servers have both been up for more than a year (of course, I don't run GW at home, and that's just a "lab" environment...)  

Or are you running GWIA on a Windoze server?
I don't think GEE Whiz is available for BillOS...

Avatar of networkn

ASKER

Ok so I have tried having gwia start last didn't help.

Bash/Perl, doesn't worry me.

The GWIA Console gets the error about Connection refused, after a restart its ok. I wondered about too many send threads
but I adjusted them down a little and no help.

There is no outbound restrictions on internet use at this site.

Static IP for Server and Router. Workstations only DHCP.

Server is being restarted because every week and a bit the tape drive loses communication with the server/backup exec 9.1
with NDMP_NO_DEVICE_ERR (16) and a server reset is the only answer. I think I have a memory leak or something that
causes it but I cannot be sure. I have seg.nlm but I cannot see anything untoward.

I haven't had the luxury of time to test when the failure occurs to determine fully if GWIA or GEE2 is at fault, but my gut is GEE2.

Sadly these days my windows sites are a LOT LOT less hassle than any of my netware sites. After NW65 just doesn't seem that stable. Memory leaks are the order of the day, cache allocators out of available memory etc.. I have been told to tune logical space but
I haven't had long enough to test that works. It took my cache memory from 2GB to 1GB and  someone said make it 1.5GB. I have 1.5GB physical memory.
What is the version/SP of the NetWare server you're having these issues with, and what is the BE 9.1 build number?  

What is the exact "connection refused" error on GWIA?
It occurs in NW65 SP3+ (any combination i've tried of post patches, SP4a, etc etc.. BE91 doesn't make any difference what build I have a variety of sites.

Heres some examples..

25 user site, 1.5GB memory, proliant server, NW65SP3 CPR Version Build 1151 of BE91
4 user site, 1.4GB memory, Proliant Server. NW65SP4a Build 1152 of BE91
20 user site, 1GB Memory, Proliant, NW65SP3, build 1067 of BE91

Top Site is connection refused site with the issues I logged this over:

02-17-06 10:06:45 4  DMN: MSG 9 Send Failure: 421 mta4-rme.xtra.co.nz connection refused from [ip removed for security reasons]
02-17-06 10:06:45 4  DMN: MSG 9 Send Failure: 450 Host down (smtp.xtra.co.nz).

02-17-06 10:07:08 0  MSG 30 Analyzing result file: MSISRV/DATA:\GW\DOMAIN\WPGATE\GWIA\result\r3f59108.014
02-17-06 10:07:08 0  MSG 30 Detected error on SMTP command
02-17-06 10:07:08 0  MSG 30  Command:  deloitte.co.nz
02-17-06 10:07:08 0  MSG 30  Response: 421 mta4-rme.xtra.co.nz connection refused from [IP Removed]

Is the site's ISP a mail-relay for their domain's mail??
Yes it is. All outbound email is send to the ISP's mail server.
Are you sure they are not greylisting you?
Yeah, a 421 usually is that the server is too busy to make a connection.  Since it's accompanied by a 450 host down on what's probably their mail router, guessing that mta4-rme is one of many mail relay host servers for this ISP, some problem at the ISP end does come to mind.

421 isn't only sent when a server's out of sockets - it can be used to refuse a connection without using one of the less innocuous "I'm not going to talk to you" type messages like the ones you get when you've been put on an ORB list.

The 450 host down that accompanies it makes me wonder if it's something at the GW site's side, though.  This kind of behavior can be seen with certain firewall configurations where the firewall acts as a mail proxy.

Is route.cfg being used for the mail relay host, or are you using the relay setting in GWIA and/or the /mh switch, is the GWIA behind NAT, does GWIA's server's resolv.cfg look at an internal DNS, then an external, or are you using NetWare's NAMED to serve private DNS and forward DNS requests to the ISP, or is the firewall/router being used for DNS proxy or DNS lookup?
>Ok so I have tried having gwia start last didn't help.
Did you also try not starting it at all during boot, waiting a while, and then starting it?


Really there are two separate problems, the email connectivity problem and the Backup Exec issue. I could be wrong, but it seems to me the Backup Exec issue is potentially more serious, since it's possibly jeopardizing your backups, and since the email problem could be on the other end.

I've fought with that same error message in BE a few times before. In one case it was a software  bug, the other times, it was actually a hardware issue, a bad cable, and a defective drive respectively. I suggest running any available hardware diagnostics on the affected server, and make sure you have the latest firmware for your server and SCSI controller.

Take a look at the BESTART.FAX and BESTOP.FAX files in the BKUPEXEC directory, and see if there are any obvious errors. Also I'd suggest you do a LOAD BEDIAG, and then take a look at the BEDIAG.FAX and DCDIAG.FAX files,

Billmercer: This problem co-incides with Cache Allocators out of memory errors, I have tried replacing tape drives, cables plus it happens on lots of sites, so its unlikely hardware error.

I have not tried rem'ing out gwia. I will try that next time I restart the server to test.

I am 100% sure we are not grey listed, given that once I can make a connection email works flawlessly normally. Its
only upon restart of the server typically.

I will get information regarding route.cfg etc but I am pretty sure its using resolv.cfg and named.


The only veritas doc that talks about this error with BE9.1, where it's actually a media device problem and not attempting to back up a WinXP client, is this: http://seer.support.veritas.com/docs/279988.htm   The other doc is if you're trying to back up WinXP clients from a NetWare BE media server.

The hardware-related doc says that it's mostly indicative of tape hardware problems.  Doesn't mention ANY O/S problems  I may be way off base here, but I think you'd probably have to reboot even MORE often if you had the same devices in the same configuration on a Windoze server.

The doc mentions an LSI logic-based HBA(scsi controller) using a specific HAM driver as being a definite problem.  It mentions other hardware issues that can cause the message, including: bad tape, dirty tape drive, bad cables, bad HBA, drive-swapping in tape libraries, problems with old hardware information in the media management database and such.

Nowhere does it mention memory or other basic server issues.  Just hardware relevant to the backup device - HBA, cabling, tape drive, tapes, library mechanism.

On these servers, does the tape drive have its own HBA, entirely separate from the disk/raid HBA, and not on another channel of the same HBA?
Yup its seperate. Totally different controller. It uses the onboard controller of the Proliant ML350 G3 and G4 Servers.

It can't really be a hardware problem on more than 6 sites.. The chances are phenominal esp since not every site uses the identical controllers/software environments, but they are all getting cache allocator errors.

The controller load commands on the proliant sites are showing adptm160 as the controller

We have all client backup agents disabled in backup exec.

If it was dirty heads, then a tape clean would fix it. However the only way to resolve the issue is to restart the server (reset server) as
I don't think a restart server helps), won't even eject the tape.
route.cfg is a special file used for GWIA, primarily in relay-host situations.  It's used by GWIA in addition to the name resolvers in resolv.cfg, to override what those resolvers would otherwise give to GWIA for certain hosts or domains.
Are you getting any other diagnostic (server health log) threshold warnings or errors, like spin locks, small-memory allocations, work-to-do's, etc?  

Are the servers using all NSS volumes?  Have you turned off cache balancing and tuned NSS cache?  What eDirectory version/revision are you running at these sites?
There is no route.cfg that I can find.
As far as I can see, even with cache allocator errors on the console, the NRM is showing green.
Where would I find spinlocks, work to do's etc?

Yes ALL NSS, where do I turn off cachebalancing? Is there a simple document for explaining nss cache tuning?

Edir is whatever is included with the service packs I installed. 3/4a
Good.  Nothing should be overridden then - it should only attempt to send to the relay host, nothing directly.  (it would have been in the \wpgate\gwia folder, btw...)

Is the relay host set up in the GWIA config as an IP address or as a FQDN?  Is it the "smtp.xtra.co.nz" or is it the "mta4-rme.xtra.co.nz" that's throwing the 421 or something else?  If it is a host name, is that host name (or any of the mta*-rme hosts) listed in the server's hosts file or being cached by NAMED, rather than being resolved "fresh" when it first tries to find it?  

When you reboot with GWIA remmed out of autoexec.ncf, before you try starting it up, try PINGing whatever is in GWIA's mail host setting to see how it resolves for you, or maybe even try telnetting to that address/url's smtp port to see what you get...
Ok well it will be a few days between now and then so I'll keep you posted. However, I still have the need for a script to fix my mail problem as originally posted :)

/mh=smtp.xtra.co.nz is in the gwia.cfg.

I am thinking that after the connection gets refused, that the system is thinking the host is down.. hence the 450 errors.


If you're in NRM, go to reports/log files and select the link for "server health log file."

It logs every time any measurement value exceeds thresholds for "suspect" and "critical" states.  For instance, if you have over 85% CPU utilization, it's "suspect" state for CPU utilization, and over 95% is "critical."   It logs all sorts of things, like work to do response time, memory allocation errors, and so on.

While you're in NRM, also go to "view memory config," "NLM memory" link, and look to see what's taking up lots of memory.   Look at it after a reboot, then look at it again later in the week (preferably close to the time where your problem would occur.)

If NSS.NLM starts out at, say, 100 MB and climbs to 200, or if DS.NLM starts out at 10MB and climbs to 50MB, or whatever, that's what you need to focus on for memory fragmentation.

The big leaker should've been fixed with most of your installations, but the tuning stuff still should be done.  

PsiCop - what do you recommend? - you've gone through this whole fragmentation/cache tuning thing already, on a much larger scale than I have...
Is the firewall doing any mail proxy, are you doing a straight NAT for GWIA, or what?  There *are* some known firewall-related issues, as I mentioned before,  where when the first server isn't available, none of the others are tried, and the connection just fails.
The firewall is going NO proxy of any description.  yes we use NAT on all our sites.
From the Health Log. (Its very long can I shorten it to just this week or year or reset it?

Saturday, 18/02/2006  10:43
Work To Do Response Time on server COBALT was in a BAD State
Current Value - 3
Peak Value - 5
Max Value - 5
Current SUSPECT threshold = More than  2 Tick(s) and Critical threshold = More than  3 Ticks
Current SUSPECT trigger delay = 10 and Critical trigger delay = 20
>It can't really be a hardware problem on more than 6 sites..
Sure it can, if they all share a common element, such as the same chipset,the same firmware version, etc. If you're having the same problem with, say, some IBM machine, and an aftermarket Adaptec controller, then I would agree. But so far all you've mentioned is Proliant servers, and based on that, it could be very well be hardware.

>It uses the onboard controller of the Proliant ML350 G3 and G4 Servers.
This may be your problem. Here's something interesting. LOTS of people are reporting many problems doing tape backups using the onboard Ultra320 Dual SCSI controllers in Proliant servers.

There have been some long USENET discussions about these controllers in both the Novell and SCO Unix forums, and the consensus is, buy a separate controller card and run your tape off of that. And if SCO and Novell agree on something, chances are it's true.

Neither ArcServe nor Backup Exec recommends using this controller, they consider it unsupported. And apparently even HP themselves don't recommend it for tape drives.

here are some references...
http://tinyurl.com/d8yct
http://tinyurl.com/ccazb

Ok but then i have non proliant servers with different controllers, using the same tape drive, with ndmp errors as well?
Actually thats true. I may also have made a mistake as the Proliant G4's are using LSI Chipset and I had that very problem (tape drive not recognised by backup exec) but these are G3 using the Adaptec Controller.
> However, I still have the need for a script to fix my mail problem as originally posted :)
Did you try creating an NCF file? That's the simplest approach I can think of.

Edit a new file called RESETGWIA.NCF, and enter
  UNLOAD GWIA
  DELAY 90 <- enter the number of seconds to wait before restarting
  GWIA.NCF <-this is my command to load the GWIA,

Then add this command to your CRONTAB, and schedule it to run once per hour, or whatever.

If you want something slightly more sophisticated, you can create a bash shell script that compares the current time to that of the newest file in the folder. To get the timestamp of the oldest file in a folder, use this:
  stat -c "%y" | sort | head -1

>the newest file in the folder.
I meant to say OLDEST file in the folder.
>Ok but then i have non proliant servers with different controllers, using the same tape drive,
>with ndmp errors as well?

That still leaves the possibility of the tape drive itself. What, specifically IS the other hardware that you see this problem with? What are all the factors these systems have in common.
Are they all using BE9.1? I have never seen this problem on my 6.5 servers, but I'm using 9.0. Have you tried downgrading to 9.0, to see if that helps?
Do they all have GEE Whiz and GWIA running?
How are you launching Backup Exec? During boot?
Are there any systems that do NOT have this problem?

Have you tried examining the *.FAX files in your BKUPEXEC folders to see if there are any errors or odd messages there that appear on all of the systems?  

Regardless of brand, you should make sure you have the latest firmware updates installed for all your hardware.

You mentioned the machines have 1.5 gigs of ram. There was an issue once with Compaq servers having Cache Allocator memory errors when the machines had a certain amount of RAM. This could be a similar problem. You might want to try bumping one machine up to 2 gigs to see if that helps. (Can't hurt..)
Are these machines all "single server" sites, with everything running on one box? Running the GWIA and GEE Whiz on a separate machine might help...

There are tons of possibilities, the more specific information you can provide, the better the chance we can identify the cause.



Ok well the original question hasn't been answered and I am unable to find a workaround.
As for the tape drive related issues, these really should be opened on a seperate question.
>Ok well the original question hasn't been answered and I am unable to find a workaround.

Er, Nobody can know what happens on your end if you don't report back.
What actually happened when you tried using a cron job to batch restart the GWIA automatically? Did it fail to run? Did it run, but not help the problem? Did the server crash?

What happened when you tried moving GEE Whiz onto a separate machine? Was it even possible to try that?

Are all the machines using BE9.1?
Do they all have GEE Whiz and GWIA running?
Are you launching BE at boot?
Are there any systems that don't have these problems?
What, other than the Compaqs you've mentioned, is the other hardware involved?
Yeh, well, the tape drive issue is why you're trying to kludge together a workaround script to do something that shouldn't be necessary.  The reason several of the Experts focused on the reason behind wanting to do the kludge rather than the kludge itself is because we tend to prefer having environments that work *right*  without needing kludgy workarounds.  

Sorry we were trying to help fix the problem rather than focusing on the kludge, but as billmercer said, if you don't respond to the Experts questions and keep us apprised of your situation.  
Well about 16 responses back I showed you the reason mail wasn't going out with the connection refused.
I have also said the problem is a memory leak problem but billmercer keeps asking me to spend hours looking at
the tape drive and hardware. The "kludge"  was a workaround that allowed me to keep the site running whilst I look after
some of the other 60 something clients I have, and gave me time to investigate without the constant calling of this client to ask
me to restart gee and gwia. gee2 isn't a particularly stable product and unloading it and reloading it is something I only want to do
when totally neccessary, so cronning it to occur every few minutes hardly seems clever.

The reason my responses got sporadic is that I simply can't spend days on end (not chargeable to boot) investigating. A Lot of these questions require time I may not have for a week or so between replies.

Some of the questions billmercer asked have been answered more than once already.

Its not that I don't appreciate the help. I do. Its just there are sometimes practical reasons why a kludge will be request instead of a solution.

The tape drive problem occurs same day memory allocator issues occur. its happening on every site I have vxa's but the hardware is different on a few of the sites. NW65SP2-5 all have memory management issues, and in those sites with cache allocator errors, irrespective of the server hardware, when the memory errors occur, the tape drive stops working.

I also sent a detailed report to exabyte but they were unable to find any connection.

These are all single server sites with between 1-2gb memory.
I'm confused. Originally you said...
>Essentially the perl script needs to check and if the oldest message in the queue has been
>around more than 15 minutes, restart the services mentioned above.. Is there another way?
> Then how do I schedule that to run every say 20 minutes or so?

I suggested two options for doing this, simply restarting on a schedule, and using a bash script
to test the age of the gwia files.

Now you say...
> gee2 isn't a particularly stable product and unloading it and reloading it is something I only want to do
> when totally neccessary, so cronning it to occur every few minutes hardly seems clever.
It was originally your idea, I just proposed two ways to implement it.

>The reason my responses got sporadic is that I simply can't spend days on end (not chargeable to boot)
>investigating. A Lot of these questions require time I may not have for a week or so between replies.
Welcome to the club, everyone here knows what that's like.  One of the benefits of EE is that you can get others to help do some of the legwork in tracking down problems, and sometimes you're lucky enough to find someone who has seen the exact same problem and has a solution.

> Its just there are sometimes practical reasons why a kludge will be request instead of a solution.

As I understand it, you want some automated process that will test for the existence of old files in a gwia directory, then shut down and restart the gwia and geewhiz services. You originally mentioned perl, but said you don't know perl. I don't know it either, but I think bash will do the job. I've given you a suggestion in that direction. If you're not familiar with bash scripting, I can give you more info, but obviously I can't just write something that will work on your system, you'd need to adapt it to your specific environment.

Ok well I was talking perl because that was what somone else suggested it.
If bash works then thats fine too, it just needs to work full stop.

Don't get me wrong, I do genuinely appreciate the efforts made to try and resolve the issue.

I am not at all familar with bash so if someone could show me a way in bash I'd be most pleased.

What information would you require in order to help me write a suitable script?

the directory where I need to check the age of the files is data:\gw\gwdom\wpgate\gwia\third\send

If the oldest message is more than 15 minutes old, then unload gwia.nlm and gee2.nlm and gee2web.nlm and reload them.
Bash is from the Unix world, and it's remarkably powerful. Most of my bash experience is in linux, and using it in Netware is a little strange, so I tend to do as much as possible with a regular NCF file, and use bash for the stuff a plain batch file can't handle.

So for example, in your case, I would create an NCF file that unloads both GWIA and GEEwhiz,  waits an appropriate  time, and then reloads them. Then use bash to actually test for the file timestamps, and run the NCF file if the time is exceeded.

I'll be back with something more specific a little later, my wife is waiting for me right now.



ASKER CERTIFIED SOLUTION
Avatar of billmercer
billmercer

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
well done sir I certainly appreciate the detail included in this post. I will try it out in the next few days and let you know the results.
I have also increased the points given the amount of time that must have taken you.
The Asker wrote me directly, and I regret its been so long since I could sit down and compose a reply to this Question.

Personally, I suspect the Bash script supplied by billmercer is the way to go. I've discovered a number of serious limitations in the NetWare-based Perl implementations (obviously not present in OES-Linux), and while what the Asker needs could probably be engineered in Perl, I suspect it would involve a lot trial-n-error. The Bash-based method is probably more straighforward and easily debugged.

I'll note in passing that many NLMs cannot be launched from a Bash script under NetWare. Never tried it with any of the GroupWise Agents.
/me wracks his brain as to the identity of PsiCop. I am not sure when I contacted you directly?
>I'll note in passing that many NLMs cannot be launched from a Bash script under NetWare.

PsiCop, do you have a list or reference on this? I haven't encountered any problems in this regard.
I generally use bash to run an ncf file that actually launches the NLM, so maybe that would bypass any such problems? I know it works for my GroupWise agents.
Bill, its gonna be a few days before I can try this given I have staff away sick etc. I'll be in touch as soon as I can.
networkn, I'll reply to your E-Mail when I get the chance.

billmercer, I know from experience that the OpenSSH-derived NLMs (e.g. SSH, SFTP) cannot be launched from a Bash shell on NetWare. I don't have a reference listing (excellent question to ask at BrainShare this year) but I've heard anecdotally that a significant number of NLMs suffer this limitation, especially if they were compiled using older libraries. I'm glad to hear it works for the GroupWise agents.
I'll accept this, though I don't have time to test it properly.