?
Solved

Server randomly rebooting (NW65 SP5)

Posted on 2006-05-01
12
Medium Priority
?
511 Views
Last Modified: 2008-02-01
Hello experts,

I am having some issues with one of our Netware servers and I am looking for some advice.

Our St. Louis server has been randomly rebooting itself.  I don't see anything being logged as to a reason why, nothing in the abend log, etc.  I blew it off the last few weeks knowing we were going to replace that server since it was end of lease.  I just replaced it last Monday (with Novell's migration utility), and surprisingly, it has rebooted itself three times now for no apparent reason.  The OS was upgraded when we migrated hardware, was NW6 SP4 and now running NW65 SP5. We have two other servers, Chicago and Milwaukee, that have already been upgraded as well and they are not experiencing any problems.  All the same hardware and OS level.

I started blaming it on a directory issue since that is obviously brought over during migration.  I removed all the replicas from the server over the weekend.  Interestingly enough, as I was removing the ROOT partition (which is no longer needed there anyway), the server rebooted itself.  I was able to get all the replicas off however, and things looked ok.  I put the "St Louis" and "Applications" replicas back on Sunday afternoon.  Things looked stable and the server was up for about 36 hours, until just now when it rebooted itself.  The reboots happen at any time, even during off hours, no pattern.  

Any ideas for me to try?  Anyway to get more logging?  I appreciate any assistance.


PS.  I should note that I am seeing three errors in the directory.  I am working to resolve those, however, I'm "assuming" that's not really causing the reboots.  The master replicas don't show these errors, but several of my district servers do.  Even though several district servers see these errors, only my St. Louis server is having the reboot problem.  I'll include them anyway, but unless you think this is the root cause, this is not my priority right now.  Thanks again!!

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ERROR: Illegal timestamps were found in this replica.
You may need to run the advanced option:
   'Repair timestamps and declare new epoch'
Value: 66A65019, ID: 0000802C, DN: CN=FDLDBMS.OU=City1.O=Company.T=Company_TREE
Time stamp:  7-28-2024   9:05:13 am; rep # = 000C; event = 1AA9

ERROR: Illegal timestamps were found in this replica.
You may need to run the advanced option:
   'Repair timestamps and declare new epoch'
Value: 8C260E2B, ID: 00008420, DN: CN=sbirschb.OU=City1.O=Company.T=Company_TREE
Time stamp:  7-04-2044  11:03:55 pm; rep # = 000C; event = 0A39

ERROR: Illegal timestamps were found in this replica.
You may need to run the advanced option:
   'Repair timestamps and declare new epoch'
Value: 66A65018, ID: 000081CA, DN: CN=MKE.OU=City2.O=Company.T=Company_TREE
Time stamp:  7-28-2024   9:05:12 am; rep # = 000B; event = 0186
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


0
Comment
Question by:rvthost
12 Comments
 
LVL 1

Accepted Solution

by:
pspencer53 earned 800 total points
ID: 16578889
It certainly sounds like hardware.  Just to be clear, you said that you upgraded the hardware when you migrated the server.  Did you upgrade the UPS etc as well?

Assuming that you have verified all external hardware issues, does the server ABEND.LOG say anything?  What about the server hardware log (Critical Errors, etc).  
0
 
LVL 11

Author Comment

by:rvthost
ID: 16578999
I initially expected a hardware issue, but yes, the server hardware was replaced and the reboots continued.  

The UPS was not replaced, and actually, the battery is bad in it and is being swapped out tonight by our remote technician.  There is no Powerchute software installed or anything, but you think that could be a potential reason?

The abend log is empty, and the sys$log.err and health.log look fine.  It's a HP server, so I have checked cpqiml as well as Insight manager and all looks fine.

Thanks.
0
 
LVL 34

Assisted Solution

by:PsiCop
PsiCop earned 600 total points
ID: 16585168
I'd suspect power as the culprit.

If replacing the UPS does not solve the issue, then I think the server hardware has fallen significantly in terms of being a possible culprit, and its time to look at other possibilities.

NetWare v6.5 SP5 is recent code, so its probably not some known but unpatched issue.

If the reboots persist past replacing the UPS, then we need to know more about the hardware. Also, are there any entries in SYS:SYSTEM\ABEND.LOG that correspond to the reboots? Is the server parameter AUTO REBOOT AFTER ABEND set to 0?
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 11

Author Comment

by:rvthost
ID: 16585837
Thanks for the comments PsiCop.  

The UPS battery was replaced last night.  The server has been up for 16 hours and counting, so we will see :)

The abend.log file is empty so it appears that it is not really abending.  As such, I have not set the AUTO REBOOT AFTER ABEND settings.  If the server bounces again, I'll post more hardware information, but as I mentioned, we have the exact same server hardware in other locations that are not rebooting, same OS level, etc.  Thanks!

Ryan
0
 
LVL 35

Assisted Solution

by:ShineOn
ShineOn earned 600 total points
ID: 16586574
Yes, a flaky UPS can cause random crashes.  There weren't abends, just hardware crashes, which is why you couldn't find anything logged.

Your eDirectory errors are another issue that must be attended to.  You probably had/have a timesync issue that caused synthetic time to be used, resulting in the illegal timestamps.  If your timesync isn't off anymore, then performing the recommended actions (repair timestamps and declare a new epoch) should fix the problems.

It really should have a read/write replica of [Root] if it's the only server at that WAN location, to a) minimize tree-walking, b) allow access & support services if the WAN is down and 3) provide the recommended 3rd replica for fault-tolerance.  Before you put it back on, though, you should resolve the illegal timestamps issue.
0
 
LVL 11

Author Comment

by:rvthost
ID: 16588393
ShineOn, thanks for your comments as well.  So far, so good, so hopefully the UPS replacement resolved the issue.

Regarding the timesync issues, I did do the repair and declare a new epoch over the weekend, and that did not resolve the problems.  I'll play around with that issue some more.  If it continues, I'll open a separate question since that digresses a bit.  Thanks again!
0
 
LVL 35

Expert Comment

by:ShineOn
ID: 16588523
That kinda thing is done as an advanced repair to the local database, IIRC, so it needs to be done on each server holding those replicas.
0
 
LVL 11

Author Comment

by:rvthost
ID: 16588637
I went by this info:

http://www.novell.com/support/search.do?cmd=displayKC&docType=kc&externalId=10020107&sliceId=&dialogID=2703561&stateId=0%200%202705935

I did it from the Masters, which should basically recreate that particular partition on any remote servers.  It seemed to perform normally, the remotes temporarily went into a "new" state.  However, it didn't fix it.

I assume this may be the next step.  I have not yet done this:

http://www.novell.com/support/search.do?cmd=displayKC&docType=kc&externalId=10024758&sliceId=&dialogID=2703558&stateId=0%200%202705920
0
 
LVL 35

Expert Comment

by:ShineOn
ID: 16589727
That's right - declaring a new epoch does refresh the partition through the replica ring and should be done on the replica master.  Sorry. :(

You had a mix of NetWare 6 and NetWare 6.5.  Do you have a consistent eDirectory version across all the servers?  If not, is Master of [Root] on a 6.5 server (the latest eDirectory version in your environment) as it should be?
0
 
LVL 11

Author Comment

by:rvthost
ID: 16589840
No problem! :)

We are still mixed between NW6 and NW65.  The Netware 6 boxes are on edir 8.7.3.  But yes, the Master is on NW6.5 SP5 as it should be.  We won't be completely upgraded to 6.5 until late August.  Thanks again for the assistance.

I'll try to give that second TID a shot later in the week. Have you ever used the -xk3 switch?  I assume that doesn't start blowing away replicas like -xk2 does? :)
0
 
LVL 35

Expert Comment

by:ShineOn
ID: 16591056
IIRC, that kills hanging obits.  It might work, but since the TID is for older versions of NetWare, I'd research it a bit more before using any -xkwhatever switches.  It's not the kind of thing to do just to see if it works ;)
0
 
LVL 11

Author Comment

by:rvthost
ID: 16629640
Thanks everyone for the comments.  The server has apparently stopped rebooting itself so it appears the UPS was in fact the culprit.  I'll plan to split points.  Thanks again for your help!
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

LinkedIn blogging is great for networking, building up an audience, and expanding your influence as well. However, if you want to achieve these results, you need to work really hard to make your post worth liking and sharing. Here are 4 tips that ca…
Social messanging services like WhatsApp and Facebook can help businesses in ways that many owners don't even imagine, giving new opportunities to connect with customers. Discover some of the most innovative things they can do for your company.
Is your data getting by on basic protection measures? In today’s climate of debilitating malware and ransomware—like WannaCry—that may not be enough. You need to establish more than basics, like a recovery plan that protects both data and endpoints.…
As many of you are aware about Scanpst.exe utility which is owned by Microsoft itself to repair inaccessible or damaged PST files, but the question is do you really think Scanpst.exe is capable to repair all sorts of PST related corruption issues?
Suggested Courses
Course of the Month17 days, 6 hours left to enroll

862 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question