Linux - unexplicable computer crash, any idea why?

Our office webserver is running Slackware 13.37.0, kernel 2.6.37.6-smp. Fives times this year this computer has unexpectedly crashed; most recently, twice in the past 3 days. When I check the console, all I see is output as in the attached image. Not sure what this is telling me, if anything. It looks sort-of dmesg-y, but a normal dmesg doesn't have any [ 251.xxxx] in what I assume is the log level prefix. Nor does dmesg show the 2nd set of bracketed codes, e.g. "[<f8a8e118>]". I don't know if this output indicates a failed reboot or if it is some kind of diagnostic dump while running, ending in a system halt.

Does this mean anything to anyone? I'm trying to figure out what could be the cause before I start randomly replacing components.

hitting the reset button successfully restarts the computer, albeit with fsck's because the drive was not properly unmounted.
2015-05-17X21.56.22.jpg
LVL 1
jmarkfoleyAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

andreasSystem AdminCommented:
Your screenshot looks like a cal ltrace of a kernel panic. The first numbers are timestamps 251.xxxxx.

At first I would try to run a memory test on that machine to check if the RAM is okay.
0
Zephyr ICTCloud ArchitectCommented:
It looks like a kernel panic, after you reboot you should check your logs to see if anything interesting is logged prior to the panic
# cd /var/log/
# cat messages | less

Open in new window


The fact there's mentioning of an irq issue in the screenshot might suggest driver issues, did you recently upgrade/update something on the system? Is this a physical or a virtual system?
0
Gerwin Jansen, EE MVETopic Advisor Commented:
rtl8139 is the network adapter, could be defective. I'd try replacing it. Do you know of previous crashes if the same lines were in the log? With the rtl8139 device I mean.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

andreasSystem AdminCommented:
Another question:

System is up to date with all security updates installed? especially kernel updates at a current level?

Recently there were some kernel flaws that could be used to crash the machine from remote.
0
matrix8086Commented:
I had the same problem and it was a defective NIC. After NIC replacement, all goes to normal.
0
jmarkfoleyAuthor Commented:
spravtek
... after you reboot you should check your logs to see if anything interesting is logged prior to the panic
Nothing in /var/log/messages. There is a big gap from 18:51 (about when the computer crashed) and 22:19 when I restarted.
The fact there's mentioning of an irq issue in the screenshot might suggest driver issues, did you recently upgrade/update something on the system?
No changes to system for quite some time. I tend to not do upgrades except to specific things like Apache or Java -- for me, doing so has caused problems in the past. Rather, I generally upgrade the whole system periodically -- which it's probably past time to do in this one.
Is this a physical or a virtual system?
Physical.

Gerwin Jansen:
rtl8139 is the network adapter, could be defective. I'd try replacing it. Do you know of previous crashes if the same lines were in the log? With the rtl8139 device I mean.
I did take a photo of the crash 2 days before and yes, the lines are basically identical except for different timestamp values. There are 2 NICs in this machine, one built into the motherboard.

andreas:
Another question:
System is up to date with all security updates installed? especially kernel updates at a current level?

Recently there were some kernel flaws that could be used to crash the machine from remote.
As mentioned above, no new updates.

matrix:
I had the same problem and it was a defective NIC. After NIC replacement, all goes to normal.
I'll try replacing the other card today.

andreas:
At first I would try to run a memory test on that machine to check if the RAM is okay.
I'll run a RAM test when I replace the NIC.
0
andreasSystem AdminCommented:
then install the kernel updates there were several flaws that can be used to crash the machine from remote. The panic related to NIC interface also could indicate some exploitation in the nic driver caused by unusually  formatted packets.

And do a memtest and let it run 2-3 passes.
0
andreasSystem AdminCommented:
Even if updates are causing issues from time to time, its not a good practice to delay them too long. Better is to implement them on a testing system and if testing went smooth deploy them on the public server.

Another way is virtualize the public facing server, than its easy peasy to roll back a defective update session and install only the patches thar are ok afterwards.
0
jmarkfoleyAuthor Commented:
andreas:
install the kernel updates there were several flaws that can be used to crash the machine from remote.
I'll probably first install the NIC (in about an hour) and let that go for a while. I hate to change more than one thing at a time.

Not long ago, I ventured outside my comfort zone and did try updating a live Samba4 DC/AD system (`slackpkg upgrade-all`). After the upgrade, sambatool gave me the error "ldb: module version mismatch in ../source4/dsdb/samdb/ldb_modules/acl.c :   ldb_version=1.1.16 module_version=1.1.17" indicating the ldb module was left out of the upgrade. I ended up having to rebuild the entire system from scratch. Hence my reluctance to upgrade OS related elements (like I said, I'll upgrade Apache, tomcat ... things like that: tools/utilities). Since I normally try to keep fairly up-to-date on releases I figure things will work well enough until the next release update -- which is a new install on new hardware. After all this isn't Windows! However, I *may* try the kernel updates as you suggest. I can always restore the previous image.
Another way is virtualize the public facing server, than its easy peasy to roll back a defective update session and install only the patches thar are ok afterwards.
To seal my conservative curmudgeon credentials, I've not messed with virtualization. I'd love to see a good article on why this is generally useful for Linux other than providing hosting service for multiple clients in order to give each their own virtual machine. Short of that, what does it buy me (you don't have to answer that -- a whole 'nuther topic).

I'll post back when I've changed out the network card, though I won't know anything definitive if it doesn't fail for a while.
0
jmarkfoleyAuthor Commented:
Replacing the NIC appears to have done the trick.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.