Solved

Ext3 filesystem failed

Posted on 2002-07-17
10
437 Views
Last Modified: 2013-12-15
We have 3 Seagate drive (ST360021A) installed on a RH7.2 system with dual Celeron 533MHZ CPU. The motherboard has 2 HPT366 UltraDMA/66 controllers.

The filesystem(ext3) keep fails.
The first time, we used 2 drives for RAID and failed. Then we put another one in and copy all data (some are missing or failed) to that drive.

After a while (half month), the filesystem failed again and lost some data.

Now, this is the third time, with one single drive and the filesystem (ext3)failed again. We have important data lost, when we change (cd) to that directory and it says "Input/output error", please help!
Tell me what's the solution to get back the data and what's the possible problem. Thanks

BTW, we have 2 IBM 47GB drives and have no problem til now.

Leo
0
Comment
Question by:leochan72
  • 5
  • 3
  • 2
10 Comments
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7160529
boot into single user mode, unmount the corrupt filesystem, then do a fsck.
0
 
LVL 1

Author Comment

by:leochan72
ID: 7160674
ahoffmann,

Well, I tried it. However, it has a message that

"Inode 16419 has illegal block(s). Clear<y>?"

Actually, I run the e2fsck after I umounted the drive when the drive first failed. I hit "Enter" after this message, and it erased something in the filesystem and made it even worst.

Any other suggestions? Thanks

Leo
0
 
LVL 51

Accepted Solution

by:
ahoffmann earned 200 total points
ID: 7160703
there is no other way than fsck to repair a corrupted disk, except professionell data recovery services.
Well, you can use dd to dump the partition to a file, and then search through the dump and check if you find your data, 47GB ...

Do you have appropriate messages in /var/adm/messages?
0
 
LVL 1

Author Comment

by:leochan72
ID: 7160731
I found some messages "Lserv kernel: APIC error on CPU0 : 04(04)" or on CPU1. Is it because the dual CPU system not working fine and cause this problem? Do you know what is APIC?

Leo
0
 
LVL 1

Author Comment

by:leochan72
ID: 7160747
I found some messages "Lserv kernel: APIC error on CPU0 : 04(04)" or on CPU1. Is it because the dual CPU system not working fine and cause this problem? Do you know what is APIC?

Leo
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 51

Expert Comment

by:ahoffmann
ID: 7160793
quote from /usr/doc/howto/en/SMP-HOWTO.gz (sounds like you have a hardware problem):

....
  14.
     "APIC error interrupt on CPU#n, should never happen" messages in
     logs

     A message like:

     ___________________________________________________________________
     APIC error interrupt on CPU#0, should never happen.
     ... APIC ESR0: 00000002
     ... APIC ESR1: 00000000
     ___________________________________________________________________


  indicates a 'receive checksum error'. This cannot be caused by Linux
  as the APIC message checksumming part is completely in hardware. It
  might be marginal hardware. As long as you dont see any instability,
  they are not a problem - APIC messages are retried until delivered.
0
 
LVL 40

Expert Comment

by:jlevie
ID: 7161017
Have all of the failures been related to a particular drive? If so it might point to a problem with that drive, its cable, or controller. Also keep in mind that failing hardware (Motherboard, one or both CPu's, or memory) could write bad data to the drive, thus damaging the file system it was writing to.

FYI, I had a similar problem with a racked server last winter. The problem turned out to be a marginal cooling fan that only became a problem in the coldest weather when the AC system essentially shut down due to low outside temperatures.  The system would begin to run a bit on the hot side and eventually crash, writing garbage to some of its mounted file systems in the process. Since the coldest time was in the wee hours of the morning no one was there to see the beginning of the failure and the next morning we'd be presented (some of the time) with a corrupt file system and data loss. Regular (daily) backups are a wonderful thing...
0
 
LVL 1

Author Comment

by:leochan72
ID: 7162610
So, is there no other way to recover the data except request for a data recovery company to do it?

I also know that the importance of backup, however, because I use the backup drive to replace the first failed RAID drives, and I don't want to format those failed drives, so no more drive left in our company to do the backup. So bad.

In fact, when I first installed the RAID drive, they might be placed too close together and too hot and made them fail. However, I used Seagate tool to check those drive and nothing wrong with those drives.

jlevie, what OS is using in the failed system? Is it Linux too? Is the ext3 or ext2 not good as FAT32 or NTFS? Looks like nobody have problem with NTFS or FAT32.

I want to keep this open for a while, see any other method suggested.

Leo
0
 
LVL 40

Expert Comment

by:jlevie
ID: 7162726
The server that I was referring to was a w2k box, but it really doesn't matter what OS is being used when you have that kind of hardware problem. If the OS writes garbage to the disk, you wind up with a corrupt file system (be it FAT32, NTFS, UFS, XFS, ext2, etc). The solution, of course, is to eliminate the hardware problem.

On the subject of backups... Reliable backups of a server require a good schedule and history. By that I mean that one does a full backup at some scheduled interval and incrementals on all days between full backups. History is important and I wouldn't want less than two complete cycles (full & incrementals) on hand at all times. Unless you have some mega RAID volume on another system, this sort of backup strategy requires backup to tapes. Tape backup systems are expensive, but well worth the cost when something like this happens. If you can't afford a high capacity tape drive and must backup to disk, be darn sure to make the backup repository reside on a different system and make sure that you have enough disk on that box for at least two full cycles plus the current cycle. A corrupt or missing file might not be noticed immediately and if you don't have history in your backups that goes back far enough you won't be able to recover. FWIW, On the servers that I manage I keep at least 6 months of backup history (monthly full & daily incrementals). On the really, really important servers I have a year's worth of history. These are all backed up to DLT tapes, with the tapes stored in a fire resistant safe and bi-annual full backups stored off site.
0
 
LVL 1

Author Comment

by:leochan72
ID: 7186526
Thanks, both of you. I believe that there will be no more comment or solution on the data recovery except the third party recovery.

I am going to close this question and give the points to ahoffmann as he is the first one tried to answer my question. Anyway, thanks to jlevie.
0

Featured Post

Free Gift Card with Acronis Backup Purchase!

Backup any data in any location: local and remote systems, physical and virtual servers, private and public clouds, Macs and PCs, tablets and mobile devices, & more! For limited time only, buy any Acronis backup products and get a FREE Amazon/Best Buy gift card worth up to $200!

Join & Write a Comment

Currently, there is not an RPM package available under the RHEL/Fedora/CentOS distributions that gives you a quick and easy way to allow PHP to interface with Oracle. As a result, I have included a set of instructions on how to do this with minimal …
It’s 2016. Password authentication should be dead — or at least close to dying. But, unfortunately, it has not traversed Quagga stage yet. Using password authentication is like laundering hotel guest linens with a washboard — it’s Passé.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now