tanel
asked on
Server hangs
Hello,
After 4 months of perfect work my server with opensuse 11.4 started to freeze. When I'm trying to login into it from another server in the same network I get just the following message:
Last login: Sat Sep 10 23:17:38 2011 from blablabla
Have a lot of fun...
and nothing else. From my home pc I can't even get connected with it, just form the neighbour server.
I have 16 Gb of ram and don't believe that swap gets full..Also I have cron job, that is running every 2 min and killing those processes that are using more than 90% of cpu. So it shouldn't be the overload issue either..
Could somebody explain such server behaivior and how to find the cause ? (no error logs)
After 4 months of perfect work my server with opensuse 11.4 started to freeze. When I'm trying to login into it from another server in the same network I get just the following message:
Last login: Sat Sep 10 23:17:38 2011 from blablabla
Have a lot of fun...
and nothing else. From my home pc I can't even get connected with it, just form the neighbour server.
I have 16 Gb of ram and don't believe that swap gets full..Also I have cron job, that is running every 2 min and killing those processes that are using more than 90% of cpu. So it shouldn't be the overload issue either..
Could somebody explain such server behaivior and how to find the cause ? (no error logs)
ASKER
I have another server, where autokill cron script is setup to 85% of cpu and everything is fine.
And as I mentioned before the server was stable almost for 5 months and the script kills the processes just with specific name (top -n 1 -b | grep "hlds_i686"). So, any system process can't be killed.
And as I mentioned before the server was stable almost for 5 months and the script kills the processes just with specific name (top -n 1 -b | grep "hlds_i686"). So, any system process can't be killed.
the script kills the processes just with specific name (top -n 1 -b | grep "hlds_i686")
You didn't say that at first, you said "those processes that are using more than 90% of cpu".
Are you able to login via console? Is the problem happening to all users? Have you tried logging in from more than 1 other server?
What exactly do you mean by it's starting to freeze? Is the only symptom here that you are having trouble logging in remotely or is there something else that makes you think the system is hanging?
It could well be an issue with HalfLife, can you post your debug.log please
ASKER
Thanks for your replies!
Every time I have to go to datacentre for hard reboot, since I can't even directly send any command to the shell, it asks for password , accepts and nothing appears (with all system users). No any strange log in messages and the drive space is okey.
There are several hlds's running and it's hard to find the right debug log. Also if something gets to the debug log, the system tells about is in "messages" - "segmentation fault..." BUT always the last logs are as usual.
I have SSD(with trim) , 16 GB ram and 2 Gb of swap installed. Is it possible, that some process can eat the whole ram and swap in 2 hours ? I don't even know if it's a kernel/OS or hardware issue.. The first step i gonna do now is to switch the kernel with the default suse's one..
Every time I have to go to datacentre for hard reboot, since I can't even directly send any command to the shell, it asks for password , accepts and nothing appears (with all system users). No any strange log in messages and the drive space is okey.
There are several hlds's running and it's hard to find the right debug log. Also if something gets to the debug log, the system tells about is in "messages" - "segmentation fault..." BUT always the last logs are as usual.
I have SSD(with trim) , 16 GB ram and 2 Gb of swap installed. Is it possible, that some process can eat the whole ram and swap in 2 hours ? I don't even know if it's a kernel/OS or hardware issue.. The first step i gonna do now is to switch the kernel with the default suse's one..
ASKER
I've requested that this question be deleted for the following reason:
have to test
have to test
I don't understand why this is being deleted in order to "test".
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hello,
After some hours of detailed log investigation I have found the CAUSE of freezing:
Any thoughts ?
After some hours of detailed log investigation I have found the CAUSE of freezing:
Sep 11 12:11:52 cs kernel: [ 802.825331] [drm:pch_irq_handler] *ERROR* PCH poison interrupt
Any thoughts ?
ASKER
I google it and it has to be some 2.6.38 kernel issue, that I have been using before..
If someone has more information please let me know.
If someone has more information please let me know.
ASKER
Fixed by myself.
That is not a wise thing to do, and is probably the source of your problem. Have you tried commenting out that cron job and rebooting to start fresh? Who knows what that cron job killed that shouldn't have been...