Link to home
Start Free TrialLog in
Avatar of yeewee64
yeewee64

asked on

Redhat 9 hang

Hi experts,

I'm having problem with my server running Linux Redhat 9. The output of "uname -a" is:

Linux Premium 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux

It is a quad-CPU, 2G RAM clone server. The server hosts a few custom applications written by me. It hang once in a while, randomly, could be a few hours or a few weeks in between. When it hang, I can "ping" the machine, but I can't "ssh" to the server. My applications also stopped working at the same time. It will require a hard reboot to bring the server back to normal.

I've checked the /var/log/messages file but to no avail. What I can see from the file is that there's no activity when the server hang.

Can someone guide me with the troubleshooting? It looks like a hardware problem to me but I don't know how to check.
ASKER CERTIFIED SOLUTION
Avatar of Member_2_1239314
Member_2_1239314

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of yeewee64
yeewee64

ASKER

May I know how to do that? Sorry, I'm not familiar with the diagnostic of hardware...
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks Anonymouslemming for the URL, I'm going to download it.

However, I have an issue here, I can't simply shut down the server as it's serving live traffic. Although it does hang sometimes, when it doesn't hang, everything works fine. So what I'm going to do is to suggest to my superior and arrange a downtime or do the memory test when it hangs again.

At the mean time, is there any system log file that says the server hangs because it's caused by some hardware failure, like RAM or hardisk?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It is an Intel server, sorry, initially I thought it's a clone server...anyway it looks like one :)

So, base on your experience, besides RAM, what other hardware component failure can cause this problem? I'll need to do as many diagnostics as required during the downtime, any advice?
Pretty much anything from CPU, motherboard, or powersupply could do it.

Is there no pattern at all to the hangs ?
Wow...sounds like the quick fix will be to replace the server :)

I can't see any pattern to the hangs, we have run a cron job to capture the memory usage. Normally, when it hangs, the amount of free memory is either at the lowest (about 500-800MB free) or somewhere close to the lowest point since boot up. I'm not sure this means anything.

The CPU usage is generally low all the time.
What kernel are you running ? Are you running into some of the OOM crashes that have been seen ?
I don't know about the OOM crashes, could you enlighten me?

BTW, the kernel that I'm using is 2.4.20-8, the complete output is "Linux 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux".
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi HollyRidge, thanks for your comment. The server hang again last week and due to pressure from the management, applications have been migrated to a lower spec, spare machine. Everything is working fine so far and I'll try to get the opportunity to diagnose the problematic machine.

If the spare machine works well, perhaps the management will think the problem is solved and wouldn't want me to spend time on the problematic machine again. Anyway, I'll try my best to carry out diagnostics that you guys have suggested. If I don't get to do that in the next 2 weeks or so, I'll close the topic and split the points, does that sound OK to you all?
I totally understand. Sometimes it just works out that way to satisfy clients, etc.. Good luck with it and dont forget to let us know how it turns out.
Hi guys, I haven't got the chance to do further diagnostics to find out which server component caused the problem. However, we did leave the server running without running our applications in it. It's still "alive" since mid-March.

On the other hand, the standby machine that took over the job of hosting our applications have been working fine since mid-March too.

I can't jump into any conclusion without solid evidence but I guess that the problem was due to RAM. Our applications, written in C, will keep allocating and freeing memory. When the traffic is high, the amount of memory allocated can be huge and normally that's the time it hang. Now, without any application that allocate large chunk of memory, the server seems to be working fine.

Anyway, thank you guys for your input!