Link to home
Start Free TrialLog in
Avatar of Shaun Wingrin
Shaun WingrinFlag for South Africa

asked on

RAID 1 unexplained file system error Centos5.6

I have 2 identical servers. Took the perfectly working pair of drives and put them into the other server and got the following error:
Checking file system /contains a file system with error check forced.
Reaches about 60% and then
Extended attribute block 19334147 has reference count 1024 and should be 992
Unexpected inconsistency
run fsk MANUALLY and drops me to Cntrol D shell!

Can the system time cause such an error. They are different!
If I put the drives back into the original server I get a similar error condition!

Any ideas as to how to repair above unexplained error?
Avatar of woolmilkporc
woolmilkporc
Flag of Germany image

Seems there's been a defective filesystem already on the originating box.

Why don't you run an fsck manually against the filesystem in question ( /  apparently).

The system time is most probably unrelated to this issue.

wmp
... run an fsck against / by touching a file /forcefsck and rebooting afterwards,

or run shutdown -rF now

wmp
Avatar of Shaun Wingrin

ASKER

I don't believe there was defective file system from originating box as it didn't show any errors and booted fine.
It did mention an error about file system time being in the future and corrected something....
Try pressing the letter F when it says:

"run fsck manually ..."
Ok. Will try,

A question of curiosity: If I were to swop the drives around accidentally and put them in slot 2 and 1 instead of 1 and 2 would this cause an error?
F does nothing as it drops to prompt:
Give root password for maintenance.
And? Did you give the root password?

Btw another question of curiosity:
If I remove one of the RAID drives it gives this error:
error pdc: wrong # of devices in RAID set pdc_bbfcbhgdy 1/2 on dev/sda

Can the system still manage to boot with only one raid drive? Its RAID 1 so it should?
It should boot, but you will have to get around the error:

Press a key early in the boot sequence to pull up the grub boot menu, add the keyword "nodmraid" to the kernel command line and see if it boots.
I give root password. Is gives # prompt (Repair file system).

Should I do this from # prompt?
"... run an fsck against / by touching a file /forcefsck and rebooting afterwards,"
What are exact commands?

What does "nodmraid" do?
Btw at root prompt I ran:
shutdown -rF now
It reboots but with errors that READ ONLY FILE SYSTEM
Try

/sbin/fsck

nodmraid: disable software raid.

dmraid: discover and activate software raid.

Try

mount -o remount,rw / on "#(Repair ...)" prompt.

"mount -o remount,rw / on "#(Repair ...)" prompt."

Gave errors: Jourbal has aborted....
remount read-only

With nodmraid still saw in boot sequence:
dmraid45....so not sure if it actioned it....?


This is a filesystem/journal mismatch and a bit hard to repair.

You could try this:

1) tune2fs -O ^has_journal /dev/hdxxx
with /dev/hdxxx being the underlying device of the FS in question.

2) e2fsck /dev/hdxxx

3) tune2fs -j /dev/hdxxx

4) mount /dev/hdxxx /

 
say what is hdxxx

This is fdisk -l
  Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          13      104391   83  Linux
/dev/sda2              14       30175 242276265 83  Linux
/dev/sda3              30176   30272 779152 82 Linux swap/Solaris
Rather issue "mount" and look for

/dev/sda[x] on /

then use this /dev/sda[x].

According to the fdisk output it should be /dev/sda1

wmp

I assume you'll have to boot from some rescue media first since it seems we're talking about the active boot partition.
mount returns:
/dev/mapper/pdc_bbfcfbhgdgp2 on / type -ext3(rw) and then a whole lot of other things below and ends with warning /etc/mtab no readable....

any ideas?
"I assume you'll have to boot from some rescue media first since it seems we're talking about the active boot partition."
Any ideas?
This all seems very round about to repair something that was working perfectly and just broke when inserting into an identical server....?
Is there not something else that I can try?
Also because its a RAID partition - one needs to surely run the repair on the RAID partition....
Alternatively - if I removed one RAID drive and repaired it and then rebuilt the 2nd drive - would this make more sense?
Your root filesystem seems to be mounted r/w and OK.

Which problematic filesystem are we actually talking about, please?

Seems that your original Q was misleading: " Checking file system / contains a file system..."

Just to clarify there is a file system error and this is the message:
"Checking file system / contains a file system with error check forced."
However the cause was simply to move the drives from one identical server to another and back.
The error exists in both servers now.
It was working perfectly in the 1st server.
Something is causing the OS to think there is an issue....
I'm looking for an easy way to solve this...?
Can you perhaps ask some of the other experts if they have any idea as to what is causing this behavior.
This is not the 1st disk pair that I've had the identical issues with...
Rather delete this Q and ask a new one, perhaps posting some more detailed output with it.

I'll abstain from the new question, let's hear what the other experts say ...

Good luck!
Tx. What output do U suggest?
Your new Q looks quite OK.

Don't forget to delete this one, to get your points back.

wmp
I've requested that this question be deleted for the following reason:

No solution found as yet
Please update the Question with this at the bottom of it:
Can the system time cause such an error. They are different!
If I put the drives back into the original server I get a similar error condition!

Just to clarify there is a file system error and this is the message:
"Checking file system / contains a file system with error check forced."
However the cause was simply to move the drives from one identical server to another and back.
The error exists in both servers now.
It was working perfectly in the 1st server.
Something is causing the OS to think there is an issue....
I'm looking for an easy way to solve this...?


SOME SYSTEM INFO:
This is fdisk -l
  Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          13      104391   83  Linux
/dev/sda2              14       30175 242276265 83  Linux
/dev/sda3              30176   30272 779152 82 Linux swap/Solaris

mount returns:
/dev/mapper/pdc_bbfcfbhgdgp2 on / type -ext3(rw) and then a whole lot of other things below and ends with warning /etc/mtab no readable....
I wouldn't disable the Array when running the fsck on it, otherwise you would have to rebuild the array again if fsck repairs anything.

I'm not sure whether you have already done that now. Can you verify you haven't yet run fsck with the RAID disabled?
Please don't delete Qeustion then...but ask other experts for comment pls and update question as above..
Hi Rindi. I haven't run the fsk with raid disabled. Array still in place.
To me it looks like the system isn't using RAID (unless it is hardware RAID and not Software RAID). What does mount output?

The reason I think you aren't using RAID is because you should get something like /dev/md1 etc and not /dev/sda1 etc. sda would be a single drive.

Are you using hardware RAID? then it would show a single drive, like /dev/sda....?
Its a RAID configured in the System BIOS - its not an add on card.

mount returns:
/dev/mapper/pdc_bbfcfbhgdgp2 on / type -ext3(rw) and then a whole lot of other things below and ends with warning /etc/mtab no readable....
Also see:
If I remove one of the RAID drives it gives this error:
error pdc: wrong # of devices in RAID set pdc_bbfcbhgdy 1/2 on dev/sda
Server is HP Proliant Micro Server using AMD Raid chipset.
ASKER CERTIFIED SOLUTION
Avatar of rindi
rindi
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
How do I boot in single user mode?
I just checked and don't think you need that, but use the shutdown -rF now command that woolmilkporc posted in his early post, and then try the fsck I mentioned earlier.
See what I did above:
Btw at root prompt I ran:
shutdown -rF now
It reboots but with errors that READ ONLY FILE SYSTEM
Anyone with any suggestions please?
It is supposed to be a read only file-system when you run fsck (at least when you repair things). The reason is that root is mounted. If it were mounted and not set as read only, fsck can cause havoc. fsck will write the changes to the file-system, while the OS itself can't because it is read only.

Usually you would run the fsck from a boot CD to make sure the file-system you want to repair isn't mounted, but the boot CD's don't usually recognize software raid just like that, so it is easier to do it from the installed OS while the file-system is ro.
So what do U suggest I now do?
Run fsck /dev/mapper/pdc_bbfcfbhgdgp2 -y

when it has booted into the read only file-system.
ran it for fsck /dev/mapper/pdc_bbfcfbhgdgp2 -y and said file system still has errors but reboot...rebooted and got a whole host of new errors.
Ran fsck /dev/mapper/pdc_bbfcfbhgdgp1 -y and this runs very quick as it also appears under the mount.
Ran
 fsck /dev/mapper/pdc_bbfcfbhgdgp2 -y again and rebooted but still file errors.
An ideas
When it reboots it gives error about read only and when it boots it also give readonly errors
the microserver uses a fake raid, its not a hardware raid card.

are you using the drivers supplied by HP?
OK... take a step back.... It seems to me you're making an assumption that the array was fine when you removed it in the first place... just because it would boot then doesn't mean that there weren't errors then! That's one reaon why there is a date & mount-count meter on the filesystem -- so that a fsck is done at least once every xxx days or yyy mounts, as errors can sometimes "sneak up on you"...

So ask yourself:

Q: WHY do you use RAID-1?
A: Because my data is stored identically on two hard drives so that if one fails, all of my data is safe on the other

Q: What has happened?
A: I'm getting seemingly random errors on my RAID 1 Array

Q: How could that be?
A: In RAID 1, the disk READ can be assigned to either drive... assume 1 drive is good, the other bad, you'll get random failures whenever the RAID controller (hard or soft) selects the bad drive to be the source (read) drive.

Q: How do I fix it?
A: Test the drives independently -- e.g.: BREAK THE MIRROR (physically remove one drive) and run FSCK on each drive separately (without the other drive in the array). DO NOT boot the system LIVE onto either one, unless you are sure that nothing important will happen (or get stored to it) while it is up. (NOTE: You may as well admit it at this point, with all the steps you've already taken -- at this point you're in disaster recovery mode, not reboot and you're up mode!)

You should probably also look at the output of SMARTtools (smartctl -H <device>)

My guess is that one will pass fsck (and/or smart), the other not so much... NOTE: The one that passes may have some minor errors -- but they'll PALE in comparison to the other one!

Once you know which drive is the good one, get a new 2nd drive and synch your data to it... THEN you should be back in business!

I do hope this helps....

Dan
IT4SOHO
Are you sure it's running RAID 1 under Centos 5.6, as there does not appear to be any Linux SATA RAID drivers published by HP for the MicroServer on their website, and I know, that this controllers appears as two indepdant disks under VMware ESX/ESXi. (no RAID).

Windows 2008/2008 R2 have a SATA RAID driver, for AMD Ready RAID function.
what is the hardware that you are using: server make/model
RAID hardware or are you using software raid?
If hardware, did you get an alert during the bootup that informed you that the RAID volume is "seen" as incorrect and whether you want to adjust it or accept it? Did you accept it?  In hardware raid, usually you have to purge the configuration on the controller and then insert the drives and get the controller to read in the RAID configuration from the DISKS.


your fdisk -l reports that a single drive is seen which suggests that the volume is based on HARdware raid.  If you can get into the RAID controller, you may see the RAID volume reflected in degraded more.
I believe he is using a HP ProLiant MicroServer which uses a software fake raid SB700 SATA Controller in RAID mode.

see here

http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=4248009&lang=en&cc=us
With that information, the drives can not be simply moved from one system to the next and have everything work.
Moving both drives at the same time leaves no fall back options since both disks are then marked.
shutting down the original system and moving one of the RAID 1 drives is a way to retain the option of functional resources if the move does not work.

I believe the raid configuration had to be cleared prior to attempting to boot the system using the moved drives.

I don't think he's using the controller's RAID, but rather CentOS software RAID.

What does cat /proc/mdstat say?
fdisk -l in http:#a36558971 suggests that the OS only sees a single drive /dev/sda.
The Type of partition is also 83 but with software raid should likely be fd.
This is what using:

I believe he is using a HP ProLiant MicroServer which uses a software fake raid SB700 SATA Controller in RAID mode.

see here

http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=4248009&lang=en&cc=us

Any ideas how to recover the disk?
Tun the cat /proc/mdstat command I posted, it should help us find out whether you are using software RAID or the Controller's RAID.
cat /proc/mdstat
If I run it from the repair console it returns
personalities:
unused devices:none....?
Then it does look like you are using the controller RAID (there is actually a driver for redhat for it, and CentOS is a redhat clone, so you would use that same driver on CentOS).

Boot the server into the RAID config utility and check what status you get there. If possible put the HD's in a PC or server that doesn't have RAID, then test them using the HD manufacturer's diagnostic utility.
Boot tools shows all ok...?
What steps did you file to transfer the disks from one server to the other?
Did you get an error prompt during the initial boot after the disks were transfer to the other system?
Just pulled and inserted them.
No error
when you moved the drives from one server to the next, what steps did you take to perform the transfer?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
My suggestion would be to remove one of the drives to see whether the system will operate without errors while the RAID will be broken.

I think this is the test route it4soho is suggesting.
I'm not sure how you will determine which of the two drives is out-of-sync.
I think a re-install is needed. Tx