Solved

System Freezes from time to time, unsure of cause...

Posted on 2004-04-15
22
316 Views
Last Modified: 2010-04-20
My system has been freezing up occasionally and I am not sure what the cause is. I checked var/log/messages and noticed the last two times it froze and I had to reboot, there was a consistent entry in the log, specifically:

Apr 12 04:02:33 a1250 su(pam_unix)[3177]: session opened for user news by (uid=0)
Apr 12 04:02:33 a1250 su(pam_unix)[3177]: session closed for user news

Here are the parts of my system log that show where I had to reboot two times in the last 3 days:



1rst Freeze/Reboot:
...
Apr 12 04:01:02 a1250 kernel: smb_retry: successful, new pid=4605, generation=155
Apr 12 04:02:33 a1250 su(pam_unix)[3177]: session opened for user news by (uid=0)
Apr 12 04:02:33 a1250 su(pam_unix)[3177]: session closed for user news
Apr 14 00:45:24 a1250 syslogd 1.4.1: restart.
Apr 14 00:45:24 a1250 syslog: syslogd startup succeeded
Apr 14 00:45:24 a1250 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Apr 14 00:45:24 a1250 kernel: Linux version 2.4.20-30.9 (bhcompile@daffy.perf.redhat.com) (gcc version 3.2.2 20$
Apr 14 00:45:24 a1250 kernel: BIOS-provided physical RAM map:
...

2nd Freeze/Reboot:
...
Apr 14 04:01:02 a1250 kernel: smb_retry: successful, new pid=4619, generation=4  
Apr 14 04:02:30 a1250 su(pam_unix)[5386]: session opened for user news by (uid=0)
Apr 14 04:02:30 a1250 su(pam_unix)[5386]: session closed for user news
Apr 14 17:44:55 a1250 syslogd 1.4.1: restart.
Apr 14 17:44:55 a1250 syslog: syslogd startup succeeded
Apr 14 17:44:55 a1250 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Apr 14 17:44:55 a1250 kernel: Linux version 2.4.20-30.9 (bhcompile@daffy.perf.redhat.com) (gcc version 3.2.2 20$
Apr 14 17:44:55 a1250 kernel: BIOS-provided physical RAM map:
...


If anyone could shed some light on how to troubleshoot this further I would greatly appreciate it. Thanks!
0
Comment
Question by:mistertransistor
  • 6
  • 6
  • 6
  • +2
22 Comments
 
LVL 6

Expert Comment

by:karlwilbur
Comment Utility
Was your system frozen for 44 hours 43 minutes the first time and 13 hours 42 minutes the second time?

If not, I don't think that:

Apr 14 04:02:30 a1250 su(pam_unix)[5386]: session opened for user news by (uid=0)
Apr 14 04:02:30 a1250 su(pam_unix)[5386]: session closed for user news

is your problem.  But I don't really know for sure.

-karl
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
Are you running a Usenet news server on this box? That should be the only time that there should be a 'su news' by root. If you aren't running a news server the presence of that operation might indicate a system compromise.

And I agree with karlwilbur in that there seems to be too much of an interval between the 'su news' and the freeze. most likely something else is at the root of your problem.

What sort of services, if any, does this box provide? Is it running X when lockup occurs?
0
 

Author Comment

by:mistertransistor
Comment Utility
I do recall it freezing when X was running, but X hasn't run in a couple of weeks so I don't think that's the issue.

Should the:
     Apr 12 04:02:33 a1250 su(pam_unix)[3177]: session opened for user news by (uid=0)
     Apr 12 04:02:33 a1250 su(pam_unix)[3177]: session closed for user news
events be happening on their own? Or could it be an interactive session possibly by an intruder to the system?
0
 
LVL 20

Accepted Solution

by:
Gns earned 30 total points
Comment Utility
Looks like the news "daily routine" (scheduled by cron). Probably has nothing to do with your problem (as karl & jim suggest).
If you don't run a news server _by intent_ you should simply deinstall any package that has anything to do with providing news... Probably any INN packages.

When the machine freeze up, are there any interresting things written to the textmode console? Some kernel panic situation would be such that logging it to file would be entirely impossible... and at such times the kernel might (as a final panic action) barf some info onto the local console.

-- Glenn
0
 
LVL 20

Expert Comment

by:Gns
Comment Utility
... Or are the reboots "spontaneous"?

-- Glenn
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
If there are no error messages being logged on the text console and the machine completely freezes such that a control-alt-delete has no affect the cause is almost certainly a hardware issue, and probably disk related. The first think I'd do in a case like that is to make sure that the system BIOS is up to date. And if this is a SCSI based system I'd make sure that the SCSI controller had a current BIOS image.

The next thing would be to make sure that the Linux installation is current w/respect to the vendors security & bug fixes and that there haven't been any "out of band" additions to critical things like the kernel, Glibc, etc.

BTW: What Linux is this?
0
 

Author Comment

by:mistertransistor
Comment Utility
Redhat linux 9, kernel version 2.4.20-30.9

Actually, one thing I have noticed every time it freezes it the console has just messed up graphics on it. Instead of seeing a frozen screen with the login prompt, its looks like hundreds of lines of multiple colors , its hard to desribe. Basically just messed up graphics, line after line of some random colors, patterns. Kind of like "snow"/static on TV, but not exactly. It definitely has shown that each time it has frozen the last couple of times. The only thing I can think of that has changed in the last month or so is the addition of a USB2/firewire card. Is there any way I can see if a specific hardware device failed and cause the system to crash?
0
 

Author Comment

by:mistertransistor
Comment Utility
I was just running up2date, and I ran  " up2date --installall --channel=redhat.....
and it was about to start running, then the system froze! showing that exact same screen I just described... should I select an earlier kernel from my grub screen and see if that doesnt crash? then maybe I might have to reupdate the kernel...
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
When it locked up and had the messed up display were you in X on in a text console? And had you logged in as root or as an ordinary user and then used su to become root?

I'm not aware of any generic problems with the 2.4.20-30.9 kernel that would explain this, but I have seen odd things, including lock ups with some USB devices. It appears not to matter what kernel is being used, so I don't see it being a driver problem. In this case I'd be suspicious of the newly added USB/Firewire card, or some interaction with the motherboard, USB or graphics card, and the Kernel. Since the lowlevel behaviour of a system is strongly influenced by the BIOS (it mirco-programs the chipset) it would be good to verify that you do have the lastest BIOS installed.

It might be worthwhile to try removing the USB card and see if the problem persists since there's some evidence to suggest that it might be involved in the problem.
0
 
LVL 20

Expert Comment

by:Gns
Comment Utility
Could perhaps be a case of "IRQ sharing gone bonkers"... Cards try to share interrupts... and then don't :-).
Look at "cat /proc/interrupts" to see what linux sees:)

-- Glenn
0
 
LVL 8

Expert Comment

by:da99rmd
Comment Utility
Any of you knows how to turn onbord usb off(so it doesnt take IRQ when not used), in an simular situation as mistertransistor has.

/Rob
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 40

Expert Comment

by:jlevie
Comment Utility
If an on-board USB device can be disabled it will be a BIOS setup feature that does that.
0
 
LVL 20

Expert Comment

by:Gns
Comment Utility
Yup. Only thing you can do with the kernel is to make sure it doesn't contain USB support... Which will not clear the possible IRQ problem. So looking at BIOS is more or less it... Far gone is the time when one "disabled" a feature like that with plier/cutter... Thankfully:-):-).

-- Glenn
0
 

Author Comment

by:mistertransistor
Comment Utility
I'm going to investigate proc/interrupts... but FYI on Saturday I rebooted and choose an earlier kernel version from the Grub menu and so far so good, no freezing...
Also, is there an simple command to see how long the system has been running?
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
uptime
0
 
LVL 8

Expert Comment

by:da99rmd
Comment Utility
Checkout this pages
http://www-106.ibm.com/developerworks/library/l-hw1/
http://www-106.ibm.com/developerworks/library/l-hw2/

I have the same problem i have narowed it down to cpu overheating, i think anyway :)

/Rob
0
 

Author Comment

by:mistertransistor
Comment Utility
Well its been two days with no freezing since I chose an earlier kernel version from the Grub menu. When I get a chance I will run those stress tests from IBM. Seeing that my computer froze last time I tried to run up2date, I have a feeling there might be a chance of overheating, at least from what the IBM Article says.

BTW here is my proc/interrupts :

           CPU0
  0:   23535317          XT-PIC  timer
  1:         20          XT-PIC  keyboard
  2:          0          XT-PIC  cascade
  5:         33          XT-PIC  usb-uhci, usb-uhci, ehci-hcd, ohci1394
  8:          1          XT-PIC  rtc
  9:        143          XT-PIC  ohci1394
 10:    2196935          XT-PIC  usb-ohci, eth0
 11:          0          XT-PIC  usb-ohci
 12:         66          XT-PIC  PS/2 Mouse
 14:     133638          XT-PIC  ide0
 15:         44          XT-PIC  ide1
NMI:          0
ERR:          0


is it ok that device 2 and device 11 are sharing the IRQ 0? or is 0 an exception when dealing with IRQ values...
0
 
LVL 20

Expert Comment

by:Gns
Comment Utility
Exception: Yup, sort of... "synthetic" timer tick events...

Very limited sharing going on (IRQ 10 shared by usb-ohci and NIC (eth0) is perhaps the only one that stands out...) ... Saying anything else about it isn't possible, one would need look hard at the involved drivers/HW components.

Overheating is looking a bit more likely.

-- Glenn
0
 
LVL 8

Expert Comment

by:da99rmd
Comment Utility
I had the same on my computer the sharing with usb-ohci, but i simply ran
rmmod usb-ohci because i dont use any usb on my computer, i disabled the usb first in the bios.

/Rob
0
 
LVL 20

Expert Comment

by:Gns
Comment Utility
cf the previous threads above Rob:-).
Exactly the point made there...

-- Glenn
0
 

Author Comment

by:mistertransistor
Comment Utility
If I really want to single out the problem I guess I could disable USB and then go back to the latest kernel and wait to see if any freezing action ensues. If it freezes after that then I guess it's CPU overheats. I'm going to run the stress tests tonight, I think I should do that first before trying to disable the USB. My only problem with disabling the USB is I have a belkin firewire/usb 2 combo card and I have an external maxtor firewire drive. I'm wondering if I can disable the USB on that card without disabling the firewire mechanism.

I'll let you know what happens. Thanks!

-Mr. T
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
Dunno, you could try booting with nousb appended to the kernel line and see if you can still access the drive.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Introduction We as admins face situation where we need to redirect websites to another. This may be required as a part of an upgrade keeping the old URL but website should be served from new URL. This document would brief you on different ways ca…
SSH (Secure Shell) - Tips and Tricks As you all know SSH(Secure Shell) is a network protocol, which we use to access/transfer files securely between two networked devices. SSH was actually designed as a replacement for insecure protocols that sen…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now