Link to home
Start Free TrialLog in
Avatar of Mark
Mark

asked on

Linux - unexplicable computer crash, any idea why?

Our office webserver is running Slackware 13.37.0, kernel 2.6.37.6-smp. Fives times this year this computer has unexpectedly crashed; most recently, twice in the past 3 days. When I check the console, all I see is output as in the attached image. Not sure what this is telling me, if anything. It looks sort-of dmesg-y, but a normal dmesg doesn't have any [ 251.xxxx] in what I assume is the log level prefix. Nor does dmesg show the 2nd set of bracketed codes, e.g. "[<f8a8e118>]". I don't know if this output indicates a failed reboot or if it is some kind of diagnostic dump while running, ending in a system halt.

Does this mean anything to anyone? I'm trying to figure out what could be the cause before I start randomly replacing components.

hitting the reset button successfully restarts the computer, albeit with fsck's because the drive was not properly unmounted.
2015-05-17X21.56.22.jpg
Avatar of Member_2_406981
Member_2_406981

Your screenshot looks like a cal ltrace of a kernel panic. The first numbers are timestamps 251.xxxxx.

At first I would try to run a memory test on that machine to check if the RAM is okay.
It looks like a kernel panic, after you reboot you should check your logs to see if anything interesting is logged prior to the panic
# cd /var/log/
# cat messages | less

Open in new window


The fact there's mentioning of an irq issue in the screenshot might suggest driver issues, did you recently upgrade/update something on the system? Is this a physical or a virtual system?
ASKER CERTIFIED SOLUTION
Avatar of Gerwin Jansen
Gerwin Jansen
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Another question:

System is up to date with all security updates installed? especially kernel updates at a current level?

Recently there were some kernel flaws that could be used to crash the machine from remote.
I had the same problem and it was a defective NIC. After NIC replacement, all goes to normal.
Avatar of Mark

ASKER

spravtek
... after you reboot you should check your logs to see if anything interesting is logged prior to the panic
Nothing in /var/log/messages. There is a big gap from 18:51 (about when the computer crashed) and 22:19 when I restarted.
The fact there's mentioning of an irq issue in the screenshot might suggest driver issues, did you recently upgrade/update something on the system?
No changes to system for quite some time. I tend to not do upgrades except to specific things like Apache or Java -- for me, doing so has caused problems in the past. Rather, I generally upgrade the whole system periodically -- which it's probably past time to do in this one.
Is this a physical or a virtual system?
Physical.

Gerwin Jansen:
rtl8139 is the network adapter, could be defective. I'd try replacing it. Do you know of previous crashes if the same lines were in the log? With the rtl8139 device I mean.
I did take a photo of the crash 2 days before and yes, the lines are basically identical except for different timestamp values. There are 2 NICs in this machine, one built into the motherboard.

andreas:
Another question:
System is up to date with all security updates installed? especially kernel updates at a current level?

Recently there were some kernel flaws that could be used to crash the machine from remote.
As mentioned above, no new updates.

matrix:
I had the same problem and it was a defective NIC. After NIC replacement, all goes to normal.
I'll try replacing the other card today.

andreas:
At first I would try to run a memory test on that machine to check if the RAM is okay.
I'll run a RAM test when I replace the NIC.
then install the kernel updates there were several flaws that can be used to crash the machine from remote. The panic related to NIC interface also could indicate some exploitation in the nic driver caused by unusually  formatted packets.

And do a memtest and let it run 2-3 passes.
Even if updates are causing issues from time to time, its not a good practice to delay them too long. Better is to implement them on a testing system and if testing went smooth deploy them on the public server.

Another way is virtualize the public facing server, than its easy peasy to roll back a defective update session and install only the patches thar are ok afterwards.
Avatar of Mark

ASKER

andreas:
install the kernel updates there were several flaws that can be used to crash the machine from remote.
I'll probably first install the NIC (in about an hour) and let that go for a while. I hate to change more than one thing at a time.

Not long ago, I ventured outside my comfort zone and did try updating a live Samba4 DC/AD system (`slackpkg upgrade-all`). After the upgrade, sambatool gave me the error "ldb: module version mismatch in ../source4/dsdb/samdb/ldb_modules/acl.c :   ldb_version=1.1.16 module_version=1.1.17" indicating the ldb module was left out of the upgrade. I ended up having to rebuild the entire system from scratch. Hence my reluctance to upgrade OS related elements (like I said, I'll upgrade Apache, tomcat ... things like that: tools/utilities). Since I normally try to keep fairly up-to-date on releases I figure things will work well enough until the next release update -- which is a new install on new hardware. After all this isn't Windows! However, I *may* try the kernel updates as you suggest. I can always restore the previous image.
Another way is virtualize the public facing server, than its easy peasy to roll back a defective update session and install only the patches thar are ok afterwards.
To seal my conservative curmudgeon credentials, I've not messed with virtualization. I'd love to see a good article on why this is generally useful for Linux other than providing hosting service for multiple clients in order to give each their own virtual machine. Short of that, what does it buy me (you don't have to answer that -- a whole 'nuther topic).

I'll post back when I've changed out the network card, though I won't know anything definitive if it doesn't fail for a while.
Avatar of Mark

ASKER

Replacing the NIC appears to have done the trick.