?
Solved

Redhat 9 hang

Posted on 2005-03-15
15
Medium Priority
?
518 Views
Last Modified: 2010-04-20
Hi experts,

I'm having problem with my server running Linux Redhat 9. The output of "uname -a" is:

Linux Premium 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux

It is a quad-CPU, 2G RAM clone server. The server hosts a few custom applications written by me. It hang once in a while, randomly, could be a few hours or a few weeks in between. When it hang, I can "ping" the machine, but I can't "ssh" to the server. My applications also stopped working at the same time. It will require a hard reboot to bring the server back to normal.

I've checked the /var/log/messages file but to no avail. What I can see from the file is that there's no activity when the server hang.

Can someone guide me with the troubleshooting? It looks like a hardware problem to me but I don't know how to check.
0
Comment
Question by:yeewee64
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 4
  • 2
  • +2
15 Comments
 
LVL 9

Accepted Solution

by:
gtkfreak earned 300 total points
ID: 13552818
You can check out if the RAM is okay. Enable detailed diagnostics in the computer's BIOS to see if youi get the correct RAM displayed on your screen, with the verification of each memory location.
0
 

Author Comment

by:yeewee64
ID: 13552835
May I know how to do that? Sorry, I'm not familiar with the diagnostic of hardware...
0
 
LVL 5

Assisted Solution

by:Anonymouslemming
Anonymouslemming earned 600 total points
ID: 13554002
Your best bet would be to download memtest86+  from http://www.memtest.org/ and use that

I advise downloading the CD ISO image, burning that image to a CD and booting from it.
0
Does Your Cloud Backup Use Blockchain Technology?

Blockchain technology has already revolutionized finance thanks to Bitcoin. Now it's disrupting other areas, including the realm of data protection. Learn how blockchain is now being used to authenticate backup files and keep them safe from hackers.

 

Author Comment

by:yeewee64
ID: 13554059
Thanks Anonymouslemming for the URL, I'm going to download it.

However, I have an issue here, I can't simply shut down the server as it's serving live traffic. Although it does hang sometimes, when it doesn't hang, everything works fine. So what I'm going to do is to suggest to my superior and arrange a downtime or do the memory test when it hangs again.

At the mean time, is there any system log file that says the server hangs because it's caused by some hardware failure, like RAM or hardisk?
0
 
LVL 5

Assisted Solution

by:Anonymouslemming
Anonymouslemming earned 600 total points
ID: 13560178
Not really - when you die from a hardware failure, the OS generally doesn't get time to tell you about it.

You can get some kit with predictive failure analysis, but that generally costs quite a bit. What hardware are you using ?
0
 

Author Comment

by:yeewee64
ID: 13561591
It is an Intel server, sorry, initially I thought it's a clone server...anyway it looks like one :)

So, base on your experience, besides RAM, what other hardware component failure can cause this problem? I'll need to do as many diagnostics as required during the downtime, any advice?
0
 
LVL 5

Expert Comment

by:Anonymouslemming
ID: 13562931
Pretty much anything from CPU, motherboard, or powersupply could do it.

Is there no pattern at all to the hangs ?
0
 

Author Comment

by:yeewee64
ID: 13571699
Wow...sounds like the quick fix will be to replace the server :)

I can't see any pattern to the hangs, we have run a cron job to capture the memory usage. Normally, when it hangs, the amount of free memory is either at the lowest (about 500-800MB free) or somewhere close to the lowest point since boot up. I'm not sure this means anything.

The CPU usage is generally low all the time.
0
 
LVL 5

Expert Comment

by:Anonymouslemming
ID: 13572889
What kernel are you running ? Are you running into some of the OOM crashes that have been seen ?
0
 

Author Comment

by:yeewee64
ID: 13572941
I don't know about the OOM crashes, could you enlighten me?

BTW, the kernel that I'm using is 2.4.20-8, the complete output is "Linux 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux".
0
 
LVL 38

Assisted Solution

by:wesly_chen
wesly_chen earned 300 total points
ID: 13580104
Hi,

   Besides the hardware problem, it could be the kenrel bug.

  As root,
rpm -ivh http://download.fedoralegacy.org/redhat/9/updates/i386/kernel-smp-2.4.20-42.9.legacy.i686.rpm
And modify /etc/grub.conf to set to "default=0".

  So it will load the new kernel at reboot.

   Also, turn on the verbose log level on your application to see if there is any supicious message.

Wesly
0
 
LVL 1

Assisted Solution

by:HollyRidge
HollyRidge earned 300 total points
ID: 13627305
One thing I normally do with machines to help diagnose crashing problems is to leave a ssh shell window open (using putty) running top as (top -cd 1). When the server crashes then you can go back to that window and see if you have a load or out of memory/swap issue that may be causing this issue. The bad thing is most of the time when linux based machines crash they are still pingable however all other processes and services stop responding. Now if the server crashes and you still have plenty of server resources available then more than likely a kernel panic and/or a hardware problem. If this is the case I would suggest having someone hook up a console to the machine prior to rebooting it and report any errors or output from the screen. This is usually fairly helpful in tracking down issues such as these. You see sometimes the system will output to the screen however is unable to write to the logs which is why logs show up clean. If you have a kernel panic it could still be a very good indicator that you may have a hardware issue as well depending of what it shows. For kernel panics depending on the error, I would try to upgrade your kernel and see if that helps. Now if the logs are clear and the screen is clear then more than likely you do have a hardware problem. Now as with memtest, it is good however even if memory passes the test, it could still be bad. I have ran into this a few times in the past. Hope this helps as these things are a real pain in tracking down.
0
 

Author Comment

by:yeewee64
ID: 13627460
Hi HollyRidge, thanks for your comment. The server hang again last week and due to pressure from the management, applications have been migrated to a lower spec, spare machine. Everything is working fine so far and I'll try to get the opportunity to diagnose the problematic machine.

If the spare machine works well, perhaps the management will think the problem is solved and wouldn't want me to spend time on the problematic machine again. Anyway, I'll try my best to carry out diagnostics that you guys have suggested. If I don't get to do that in the next 2 weeks or so, I'll close the topic and split the points, does that sound OK to you all?
0
 
LVL 1

Expert Comment

by:HollyRidge
ID: 13627664
I totally understand. Sometimes it just works out that way to satisfy clients, etc.. Good luck with it and dont forget to let us know how it turns out.
0
 

Author Comment

by:yeewee64
ID: 13751611
Hi guys, I haven't got the chance to do further diagnostics to find out which server component caused the problem. However, we did leave the server running without running our applications in it. It's still "alive" since mid-March.

On the other hand, the standby machine that took over the job of hosting our applications have been working fine since mid-March too.

I can't jump into any conclusion without solid evidence but I guess that the problem was due to RAM. Our applications, written in C, will keep allocating and freeing memory. When the traffic is high, the amount of memory allocated can be huge and normally that's the time it hang. Now, without any application that allocate large chunk of memory, the server seems to be working fine.

Anyway, thank you guys for your input!
0

Featured Post

Containers & Docker to Create a Powerful Team

Containers are an incredibly powerful technology that can provide you and/or your engineering team with huge productivity gains. Using containers, you can deploy, back up, replicate, and move apps and their dependencies quickly and easily.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The purpose of this article is to show how we can create Linux Mint virtual machine using Oracle Virtual Box. To install Linux Mint we have to download the ISO file from its website i.e. http://www.linuxmint.com. Once you open the link you will see …
In the first part of this tutorial we will cover the prerequisites for installing SQL Server vNext on Linux.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
Suggested Courses
Course of the Month10 days, 1 hour left to enroll

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question