asked on

Ext3 filesystem failed

We have 3 Seagate drive (ST360021A) installed on a RH7.2 system with dual Celeron 533MHZ CPU. The motherboard has 2 HPT366 UltraDMA/66 controllers.

The filesystem(ext3) keep fails.
The first time, we used 2 drives for RAID and failed. Then we put another one in and copy all data (some are missing or failed) to that drive.

After a while (half month), the filesystem failed again and lost some data.

Now, this is the third time, with one single drive and the filesystem (ext3)failed again. We have important data lost, when we change (cd) to that directory and it says "Input/output error", please help!
Tell me what's the solution to get back the data and what's the possible problem. Thanks

BTW, we have 2 IBM 47GB drives and have no problem til now.

Leo

ahoffmann

boot into single user mode, unmount the corrupt filesystem, then do a fsck.

leochan72

ASKER

ahoffmann,

Well, I tried it. However, it has a message that

"Inode 16419 has illegal block(s). Clear<y>?"

Actually, I run the e2fsck after I umounted the drive when the drive first failed. I hit "Enter" after this message, and it erased something in the filesystem and made it even worst.

Any other suggestions? Thanks

Leo

ASKER CERTIFIED SOLUTION

ahoffmann

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

leochan72

ASKER

I found some messages "Lserv kernel: APIC error on CPU0 : 04(04)" or on CPU1. Is it because the dual CPU system not working fine and cause this problem? Do you know what is APIC?

Leo

leochan72

ASKER

I found some messages "Lserv kernel: APIC error on CPU0 : 04(04)" or on CPU1. Is it because the dual CPU system not working fine and cause this problem? Do you know what is APIC?

Leo

ahoffmann

quote from /usr/doc/howto/en/SMP-HOWTO.gz (sounds like you have a hardware problem):

....
14.
"APIC error interrupt on CPU#n, should never happen" messages in
logs

A message like:

___________________________________________________________________
APIC error interrupt on CPU#0, should never happen.
... APIC ESR0: 00000002
... APIC ESR1: 00000000
___________________________________________________________________

indicates a 'receive checksum error'. This cannot be caused by Linux
as the APIC message checksumming part is completely in hardware. It
might be marginal hardware. As long as you dont see any instability,
they are not a problem - APIC messages are retried until delivered.

jlevie

Have all of the failures been related to a particular drive? If so it might point to a problem with that drive, its cable, or controller. Also keep in mind that failing hardware (Motherboard, one or both CPu's, or memory) could write bad data to the drive, thus damaging the file system it was writing to.

FYI, I had a similar problem with a racked server last winter. The problem turned out to be a marginal cooling fan that only became a problem in the coldest weather when the AC system essentially shut down due to low outside temperatures. The system would begin to run a bit on the hot side and eventually crash, writing garbage to some of its mounted file systems in the process. Since the coldest time was in the wee hours of the morning no one was there to see the beginning of the failure and the next morning we'd be presented (some of the time) with a corrupt file system and data loss. Regular (daily) backups are a wonderful thing...

leochan72

ASKER

So, is there no other way to recover the data except request for a data recovery company to do it?

I also know that the importance of backup, however, because I use the backup drive to replace the first failed RAID drives, and I don't want to format those failed drives, so no more drive left in our company to do the backup. So bad.

In fact, when I first installed the RAID drive, they might be placed too close together and too hot and made them fail. However, I used Seagate tool to check those drive and nothing wrong with those drives.

jlevie, what OS is using in the failed system? Is it Linux too? Is the ext3 or ext2 not good as FAT32 or NTFS? Looks like nobody have problem with NTFS or FAT32.

I want to keep this open for a while, see any other method suggested.

Leo

jlevie

The server that I was referring to was a w2k box, but it really doesn't matter what OS is being used when you have that kind of hardware problem. If the OS writes garbage to the disk, you wind up with a corrupt file system (be it FAT32, NTFS, UFS, XFS, ext2, etc). The solution, of course, is to eliminate the hardware problem.

On the subject of backups... Reliable backups of a server require a good schedule and history. By that I mean that one does a full backup at some scheduled interval and incrementals on all days between full backups. History is important and I wouldn't want less than two complete cycles (full & incrementals) on hand at all times. Unless you have some mega RAID volume on another system, this sort of backup strategy requires backup to tapes. Tape backup systems are expensive, but well worth the cost when something like this happens. If you can't afford a high capacity tape drive and must backup to disk, be darn sure to make the backup repository reside on a different system and make sure that you have enough disk on that box for at least two full cycles plus the current cycle. A corrupt or missing file might not be noticed immediately and if you don't have history in your backups that goes back far enough you won't be able to recover. FWIW, On the servers that I manage I keep at least 6 months of backup history (monthly full & daily incrementals). On the really, really important servers I have a year's worth of history. These are all backed up to DLT tapes, with the tapes stored in a fire resistant safe and bi-annual full backups stored off site.

leochan72

ASKER

Thanks, both of you. I believe that there will be no more comment or solution on the data recovery except the third party recovery.

I am going to close this question and give the points to ahoffmann as he is the first one tried to answer my question. Anyway, thanks to jlevie.