Link to home
Start Free TrialLog in
Avatar of marko020397
marko020397

asked on

Strange problem with Redhat 7.3

I have installed RedHat 7.3 Server to replace the old RedHat 6.2. The replacement went well and the new 7.3 worked fine for about a month. Then it started to act strangely.

It crashes randomly. When the system crashes I can ping it, but no other traffic goes through. Apache, sendmail, ssh, bind don't response. It must be rebooted. On the console it also acts strangely because no one can login or see what is wrong.

I have looked at the log files and there is nothing I could relate to a crash. When the system crashes it stops to write anything into log files.

The systems runs Apache with mod_ssl, mod_php; sendmail and other standard programs included in RedHat distribution.

The only programs that are somehow suspicious (not from RedHat distribution) are some program for e-store and RealMedia Server.

When the system crashed for the second time I took another computer, took the disks from the server and put them in another computer to eliminate hardware problem. The first computer was some duron processor the second one that now worked for a month is som athlon. It worked nearly for a month and now it crashed two times in three days.

I have installed the newest patches with up2date utility from RedHat.

The computer has RealNetworks networks cards with "dmfe" driver. This cards with the same driver work fine in many another 6.2 and 7.2 servers.

Where to look for error? Should I install RedHat 7.2 which works fine on another Intel server for some time wihtout any problems. This is the first time I used AMD. Is it possible that the problem is in the processor? Maybe I should switch to Intel.

I have looked on internet to find any information if someone has the same problem. I have found nothing.

Please help.
ASKER CERTIFIED SOLUTION
Avatar of Zoplax
Zoplax

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jlevie
jlevie

It could be a disk problem, as Zoplax pinted out, but I'd bet on it being something else.

Since nothing is being logged to files as the box dies, it's likely to be a problem that takes down part of the kernel or its services. Are you running a GUI login? If so turn that off by going to run level 3 or commenting out the prefdm line in inittab. If the kernel is dying it will probably write some error messages to the console, which the GUI login will hide and keep you from seeing. Not running a GUI login will let you see those errors.
It worked fine for about a month and then began to act strangely?
When you swapped disks did you reinstall or just slot them into place?
Did you change ANYTHING during that time. Software install, hardware, tweak your system...
Certainly a console login will let you see any messages as they appear, however they should get written to a logfile if they are making it to the console...
Kernel is still up coz you are getting a response to ping. Are you able to do a soft reboot or do you have to do a hard reboot? (ctr-alt-delete or big red switch <g>)
What kernel version are you using? also more hardware info would be useful (Fault diagnosis is a pain, you have to go through a wee tree for everything <g>) Although it is begining to sound like either a problem with the disk, except you haven't mentioned any fsck errors, or a software problem which sounds more likely. Sounds like something is killing userspace progs rather than your kernel dying (you are still getting a ping response so something is still alive <g>) Does your keyboard still respond? (Caps lock key etc)
Avatar of marko020397

ASKER

The server is located in a distant location. I went there only twice when the crash occured.

There was nothing writen on the console about any error. I could type on the console but I couldn't login and I think Ctrl-Alt-Del didn't work too. The kernel version is the one included in redHat distribution. I am not sure about version number.

When I swapped disks I just slot them into place.

There is nothing special about hardware. An Athlon processor, floppy, CD-ROM, two network cards and two disks.

There is no X windows installed on the system. It boots up in runlevel 3.
There have been two kernel updates since 7.3 was released (current is 2.4.18-5). With no other hard clues to suggest a cause I'd suggest updating the system for the latest kernel and Apache, etc.
As a curiosity...did you upgrade the 6.2 box or do a fresh install.  I know upgrading sounds easier, but I have found (on the same piece of hardware) upgrades tend to have more flakey behavior than complete installs.  Since this is Linux, save your .conf files and any other bits you cant live without and do a fresh install, re-formatting the / partition.  If the box is all one partition, you may want to rebuild it (saving the data to tape or another box) seperating the / from /home at the very least.
I installed RedHat 7.3 on an new machine. Then I transfered web sites, mailboxes,... When the new machines functionality was identical I plugged out the old one and switched in the new one.

I have taken care of software updates and the server has the newest kernel, apache,...
I guess, we'll have to wait and see if the problem reoccurs. I don't know of any problems inherit in 7.3 that would cause your problem and I do have a number of 7.3 boxes doing DNS, mail, and web that have uptimes much in excess of a month. I've not experienced any problems with those boxes, which I religiously keep up to date.

Since it might matter in this case I'll have to admit that I don't use the Internet server packages that RedHat ships. I always build my own copies of Bind, Sendmail, Cyrus/UoW IMAP, Apache, PHP, & Postgres or MySQL. That's partly because I want to be able to use the current version and partly because I want to customize the build for the environment they serve. So I can't say if the distributed copies of Apache or whatever could be part of the problem.
I think I may have found the solution. It is the AMD Athlon/Duron bug. Redad about it here:

http://www.bestcomputerbuilders.com/stories.htm

I have added "mem=nopentium" parameter to the kernel. Now is all I can do wait and see if the server will stay alive.

By the way. There is nothing wrong with the disk. I have checked it.
The computer crashed again. I have now put the disks in completely new Intel machine and of course installed Intell instead of AMD optimized kernel. It works now for a week. I'll wait an see.
Intel processor didn't help as I predicted but I was hoping my predictions were wrong.

I came to conclusion that disk is the problem although it has no bad sectors. I believe it must have some other non surface problem.

Indications:
- ping works because it doesn't need anything written on the disk
- nothing is in log files because disk stopped working
- all other services which need disk access stopped working
- when tried to login on the console I could have written the username, then everything stopped when server tried to check the username on disk

The server stopped again and I installed everything on new disks. For now the server works on new disks. I will be 100% sure I have found the error when it will work for at least a month.
It turned out to be a disk problem although disk reports no bad sectors. It must be some other more hidden disk error.

Now the server runs for a month with new disks and all the other hardware is the same.