Redhat 9 hang

Hi experts,

I'm having problem with my server running Linux Redhat 9. The output of "uname -a" is:

Linux Premium 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux

It is a quad-CPU, 2G RAM clone server. The server hosts a few custom applications written by me. It hang once in a while, randomly, could be a few hours or a few weeks in between. When it hang, I can "ping" the machine, but I can't "ssh" to the server. My applications also stopped working at the same time. It will require a hard reboot to bring the server back to normal.

I've checked the /var/log/messages file but to no avail. What I can see from the file is that there's no activity when the server hang.

Can someone guide me with the troubleshooting? It looks like a hardware problem to me but I don't know how to check.
yeewee64Asked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
gtkfreakConnect With a Mentor Commented:
You can check out if the RAM is okay. Enable detailed diagnostics in the computer's BIOS to see if youi get the correct RAM displayed on your screen, with the verification of each memory location.
0
 
yeewee64Author Commented:
May I know how to do that? Sorry, I'm not familiar with the diagnostic of hardware...
0
 
AnonymouslemmingConnect With a Mentor Commented:
Your best bet would be to download memtest86+  from http://www.memtest.org/ and use that

I advise downloading the CD ISO image, burning that image to a CD and booting from it.
0
Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

 
yeewee64Author Commented:
Thanks Anonymouslemming for the URL, I'm going to download it.

However, I have an issue here, I can't simply shut down the server as it's serving live traffic. Although it does hang sometimes, when it doesn't hang, everything works fine. So what I'm going to do is to suggest to my superior and arrange a downtime or do the memory test when it hangs again.

At the mean time, is there any system log file that says the server hangs because it's caused by some hardware failure, like RAM or hardisk?
0
 
AnonymouslemmingConnect With a Mentor Commented:
Not really - when you die from a hardware failure, the OS generally doesn't get time to tell you about it.

You can get some kit with predictive failure analysis, but that generally costs quite a bit. What hardware are you using ?
0
 
yeewee64Author Commented:
It is an Intel server, sorry, initially I thought it's a clone server...anyway it looks like one :)

So, base on your experience, besides RAM, what other hardware component failure can cause this problem? I'll need to do as many diagnostics as required during the downtime, any advice?
0
 
AnonymouslemmingCommented:
Pretty much anything from CPU, motherboard, or powersupply could do it.

Is there no pattern at all to the hangs ?
0
 
yeewee64Author Commented:
Wow...sounds like the quick fix will be to replace the server :)

I can't see any pattern to the hangs, we have run a cron job to capture the memory usage. Normally, when it hangs, the amount of free memory is either at the lowest (about 500-800MB free) or somewhere close to the lowest point since boot up. I'm not sure this means anything.

The CPU usage is generally low all the time.
0
 
AnonymouslemmingCommented:
What kernel are you running ? Are you running into some of the OOM crashes that have been seen ?
0
 
yeewee64Author Commented:
I don't know about the OOM crashes, could you enlighten me?

BTW, the kernel that I'm using is 2.4.20-8, the complete output is "Linux 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux".
0
 
wesly_chenConnect With a Mentor Commented:
Hi,

   Besides the hardware problem, it could be the kenrel bug.

  As root,
rpm -ivh http://download.fedoralegacy.org/redhat/9/updates/i386/kernel-smp-2.4.20-42.9.legacy.i686.rpm
And modify /etc/grub.conf to set to "default=0".

  So it will load the new kernel at reboot.

   Also, turn on the verbose log level on your application to see if there is any supicious message.

Wesly
0
 
HollyRidgeConnect With a Mentor Commented:
One thing I normally do with machines to help diagnose crashing problems is to leave a ssh shell window open (using putty) running top as (top -cd 1). When the server crashes then you can go back to that window and see if you have a load or out of memory/swap issue that may be causing this issue. The bad thing is most of the time when linux based machines crash they are still pingable however all other processes and services stop responding. Now if the server crashes and you still have plenty of server resources available then more than likely a kernel panic and/or a hardware problem. If this is the case I would suggest having someone hook up a console to the machine prior to rebooting it and report any errors or output from the screen. This is usually fairly helpful in tracking down issues such as these. You see sometimes the system will output to the screen however is unable to write to the logs which is why logs show up clean. If you have a kernel panic it could still be a very good indicator that you may have a hardware issue as well depending of what it shows. For kernel panics depending on the error, I would try to upgrade your kernel and see if that helps. Now if the logs are clear and the screen is clear then more than likely you do have a hardware problem. Now as with memtest, it is good however even if memory passes the test, it could still be bad. I have ran into this a few times in the past. Hope this helps as these things are a real pain in tracking down.
0
 
yeewee64Author Commented:
Hi HollyRidge, thanks for your comment. The server hang again last week and due to pressure from the management, applications have been migrated to a lower spec, spare machine. Everything is working fine so far and I'll try to get the opportunity to diagnose the problematic machine.

If the spare machine works well, perhaps the management will think the problem is solved and wouldn't want me to spend time on the problematic machine again. Anyway, I'll try my best to carry out diagnostics that you guys have suggested. If I don't get to do that in the next 2 weeks or so, I'll close the topic and split the points, does that sound OK to you all?
0
 
HollyRidgeCommented:
I totally understand. Sometimes it just works out that way to satisfy clients, etc.. Good luck with it and dont forget to let us know how it turns out.
0
 
yeewee64Author Commented:
Hi guys, I haven't got the chance to do further diagnostics to find out which server component caused the problem. However, we did leave the server running without running our applications in it. It's still "alive" since mid-March.

On the other hand, the standby machine that took over the job of hosting our applications have been working fine since mid-March too.

I can't jump into any conclusion without solid evidence but I guess that the problem was due to RAM. Our applications, written in C, will keep allocating and freeing memory. When the traffic is high, the amount of memory allocated can be huge and normally that's the time it hang. Now, without any application that allocate large chunk of memory, the server seems to be working fine.

Anyway, thank you guys for your input!
0
All Courses

From novice to tech pro — start learning today.