Urgent help required - SUSE linux emergency mode

Sri M
Sri M used Ask the Experts™
on
Dear Experts,

We have a Suse Linux server running on Raid 5 and due to some disk issue (which is now fixed) we are unable to boot the linux server and are being forced to emergency mode. I understood that this is due to some bad block based on the error.

Google and tried few but don't want to try unless it is clear to avoid loss of data

Can someone throw some light what could be the resolution to this

Thanks

Regards


[7.112162] blkupdaterequest: 1/0 error, dev fd0, sector 0
[ 8.636019] XFS (sda3): Internal error xlogclearstale_blocks(2) at line 1365 of file ../fs/xfs/xfs_log_recover.c. Caller xlogfindtail d0/0x3a0 [xfs]

Generating "/run/initranfs/rdsosreport.txt”
[ 8.704187] blkupdaterequest: I/0 error, dev £d0, sector 0 9.3641871 blkupdaterequest: I/0 error, dev fd0, sector 0
ntering emergency mode. Exit the shell to continue.

Type “journalctl“ to view system logs.

You might want to save “/run/initranfs/rdsosreport.txt” to a USB stick or /boot
after mounting them and attach it to a bug report.

Recovery of xfs file systems is not automated. We suggest using the 'xf's_repair’ tool to repair any damage to the file system. If that
doesn’t work you may zero the xfs log with the ’-L’ option to xfs_repair, however this can cause the loss of user files and/or data.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Distinguished Expert 2017

Commented:
Detail on what the drive issue and how your raid 5 was setup?
The complaint identifies sda3 as having an issue

Without knowing your setup ...

If you used software raid
cat /proc/mdstat
Mdadm needs to be used to reassemble, replace failed .......

Fsck is commonly used to check filesystem

Fd0 refers to a floppy drive if not mistaken and is not an issue

In your case, you are using an xfs filesystem, and it provides you instructions to run xfs_repair
Sri MCEO

Author

Commented:
Hi Arnold,

Thank you for your reply

Raid 5 is setup on 4 x 2 TB SSD drives and is on HP smart array controller. We have similar setup on other server it was running fine for over a year now.. this one is pretty new setup less than a month. The smart array reported two drives are having issues and removed them by itself so we lost access to logical volume where the vm storage is running. We tried to remove the drives and reinsert and smart array detected them as existing drives and it is doing parity initialization as I write this (this is the message displayed) which is nothing but it is rebuilding the data which is happening for almost two days. We were able to run couple of other small linux and windows vm's on the same storage properly but this linux vm is 4 TB HDD size and used 64 GB so far. We made a copy to be on safe side and trying to boot the server

Kindly suggest what can we do to resolve this

Thanks
David FavorFractional CTO
Distinguished Expert 2018

Commented:
1) There is no way to avoid data loss if you force a boot to run on a corrupted boot disk.

Best stop this approach now, because if you do get a boot to work, you can destroy all manner of data.

2) To fix this problem. Attach another (working) boot disk to your machine using an external disk dock, then boot from the good disk.

3) Then mount your array + go about repairing your array.
Distinguished Expert 2017

Commented:
Usually, when the HW kicks both drives, removing, reinserting is not advisable, the other issue is whether you pulled/reinserted the last drive that was kicked out and which was the reason the logical volume went offline.

If it is rebuilding based on the wrong disk's metadata depending on the time between failures...... Dataloss from the interim is expected and unavoidable.
Sri MCEO

Author

Commented:
Hi Arnold,

Logical volume went off before we pull them out and re insert them back not after reinsert .. we tried a reinsert because they are ssd and brand new and also we were scared of loosing data by replacing with new drives. We thought by reinserting it might recognize the data which actually saved 4 other vm's o the same storage only this suse server which has large disk didn't come back up and throwing that error

Regards
Distinguished Expert 2017

Commented:
Your only option is to follow the recommended solution as indicated, xfs unlike prior filesystems does not try auto repair when an issue us detected. Under other filesystems when the system detected an error, dirty bit, the OS ran fsck and only kicked out to an emergency when an essential filesystem /usr / etc. where impacted by the erroneous state.

The info from your error suggest after you check the output of journalctll and the text page, you run the xfs_repair on the filesystem.

Some data loss is possible depending on as I indicated the difference in timing of ssds being kicked out, and the order by which the rebuild through your reinsert ...

If your setup has 6 hot-swap ssds

At 1pm SSD 3 was kicked out
At 1:05pm SSD 5 was kicked out, bringing down the logical volume

You, pulled and reinserted ssd3
The rebuild began based on metadata that was present on ssd3 any events/transactions that occurred in the 5 minute interval will be discarded during the rebuild of the logical volume.

It seems the other VMs came up, but you can not tell, whether or how many changes occurred in these five minutes,.
On the suse system, the xfs consistency seems to have gone out of whack and you find yourself in the current situation.


Commonly, when an array/logical volume crashes because of a failure above the fault tolerance, (raid 5 single drive failure is tolerated) the additional SSD failure is what caused ..

One has to carefully review the raid controller log in this case, and the last SSD kicked out needs to be the first "kicked, failed" readded either through a force online, or pull/reinsert if that is the procedure on the underlying system.

In this type of a situation, the reassembly attempt will be based on the most recent data at the time of "failure"

Checking after the volume is in optimal state whether the controller had firmware updates in between your normal Checks perhaps the issue is a bug and has been addressed.
i am unsure you will actually retrieve your data.

i would normally advise an offline copy of the drives before doing anything else to avoid further data loss.

given the current situation, your best bet is probably to let the raid5 rebuild complete, then run xfs_repair and hope you do get something back. there are also a bunch of tools to retrieve undeleted files from xfs that might be able to grab pristine versions of files that would appear as rolled back after the process. minimize the writes on those drives as much as possible until the recovery is complete. i would probably use a livecd as soon as the basic recovery is complete

--

a few side notes :
- raid5 is NOT safe in terms of data consistency, has never been, will never be
- raid5 on top od SSDs is a frequent setup. unfortunately, that's really asking for trouble.
- raid arrays are no substitute for backups
CEO
Commented:
Closing comments, we tried many ways to recover the data on this specific  vm however with no luck so far. I guess we have to deal with the data loss.  

Thanks a lot for all those who stepped in to help over this
David FavorFractional CTO
Distinguished Expert 2018

Commented:
Note: You can almost surely recover your data. Seems to be no mention of if you tried every option provided to you, you might just pull your RAID controller + disks, sending these to a data recovery company.

If you're rebuilding from scratch, think twice about using XFS. As arnold said, XFS doesn't work... technically... how most other filesystems work.

When you rebuild, be sure to setup a completely separate partition like /data for your XFS data, so all your OS files go onto an ext4 filesystem.

Trying to run a full XFS system produces recovery problems like you're seeing now.
lol, i'm using xfs on production servers with few to no issue. and have been for a long time.

here are a few findings based on experiment with both accidental and voluntary power losses on multiple virtual and physical hardware.


xfs sometimes produces a few empty files when new small files are being written at the time of the power loss. those are normally recoverable.

ext4 will happily loose a week's worth of logs in similar situations ( due to a very poor buffer pool flushing policy )

ext3 without logging does worse, but i have not even tried in years

ufs will sometimes append a block of 65k null bytes, mainly to log files if you are using soft updates. never seen any loss of any kind without soft updates.

if you want a rock solid filesystem, switch to zfs. it will stay consistent, and seldom loose 0 to 1 second's worth of writes which is quite acceptable. this drops to next to zero with battery backed disks or ZIL on ssd. and handle raid much better than anything else around but that's another story. note that the linux version of zfs has been rewritten ( for idiotic licence reasons ) and should probably not even be considered production ready ( or actually be called zfs since the 2 gpl implementations have nothing in common with the original ).

kamikazes may also try btrfs which attempts to copy zfs ( what is even the point ? ) with very mild success so far. personal experience is next to zero given the results of the first early tests. but that was a couple years ago. maybe they matured in the mean time.

ntfs will usually bahave decently but occasionally mess up things to such an extent the beginning of a file might be appended with pieces of different other files

fat32 might break entirely

vfat does not usually break but may litteraly loose files. including old unmodified ones

my 2 cents
Distinguished Expert 2017

Commented:
I think the issue here xfs is the last and unlikely the culprit.
The raid volume failure is the begining.
Much depends on what the volume recovery process. In the smaller the end result were not impactful, much depends on how the other VMs differ from the larger one. Iit is one thing to have data on xfs partition. Without knowing how this VM, .....
Since the emergency kicked in suggesting the entire OS depends on xfs.
Sri MCEO

Author

Commented:
Hi David / Arnold / skullnobrains

Appreciate the info. Definitely will help us in future.

for now it was too late our guy removed and reinserted the disk as they were SSD and he assumed the raid rebuild will recover the data, it did worked partially as all vm's including few linux vm's came backup without any issue except the one we are discussing here which has large disk. for now we pressume that the data is lost as the raid rebuild is complete after re insert and the only vm suse going to recovery mode. Also we contacted a data recovery company the approximate cost what they mentioned is exorbitant but I guess that is what we need pay

Lesson learned.. also understood lot of things from you all which will definitely help plan better going forward with raids and linux vm's specially.

Thank you all for your time and have a great holiday. Happy new year  :)

Best Regards.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial