Solved

attempt to access beyond end of device

Posted on 2008-10-22
5
1,365 Views
Last Modified: 2013-12-06
A list below is what we have been experiencing over and over again. I am not sure if it is a bad hard drive or something with the 3ware raid card.

The first thing that happens is a file shows up on the file system that is about 6.2 PetaBytes. This is impossible because the raid is only 182GB

Second, are backup server tries to backup this 6.2 PT file. It continuously tries to back it up until someone finally has to stop it. I guess it would backup until the backup server drive is full.

Third, the log shows
Oct 22 04:03:02 fs kernel: attempt to access beyond end of device
Oct 22 04:03:02 fs kernel: sda2: rw=0, want=6736695736, limit=386427510

This makes perfect since because the file is not really there.
The last thing that happens is the file system will finally switch over to read only mode. We then have to reboot the server. On reboot FSCK always says the drive contains errors. FSCK then goes though and fixes a lot of inode problems.

Any ideas on why the big 6.2 PT file would keep being created and how we could stop it?
This is a raid 1 with two 186.31 GB WD drives.  This is an ext3 filesystem.
0
Comment
Question by:clintonm9
  • 3
  • 2
5 Comments
 
LVL 23

Accepted Solution

by:
Mysidia earned 250 total points
ID: 22783082
Not to rule out hardware issues;  it could be caused by a
problem with the controller,  or (theoretically) the drive,
but most likely a drive failure would result in the controller failing the drive
and marking the array degraded

It sounds like an inconsistent filesystem to me.
FSCK is most likely not able to fix all the problems.

An Ext3 filesystem with corrupt metadata is an insidious problem to fix.
Insidious corruption is best avoided if possible by ensuring your kernel
is modern with old bugs addressesd.

Ext3 is journaled, but it is not perfect -- especially if your hardware implements
write caching and the write cache is not batter-backed.

This type of corruption is possible in a simple power failure situation.

It can also be caused by a software (OS) bug, or a controller/hard drive issue.
Unless you start testing your hardware and looking through 'dmesg' for errors,
there is no way to tell.




Clearly a backup should be made of all files.

If possible swap both controller and drives with spares, and
test the possibly bad controllers and drives on a test system.


I think fresh EXT3  filesystems should be re-created.

Then load the backup files onto the fresh EXT3 filesystems.

This is really the only way to ensure there are not unknown errors in
filesystem metadata.


*Cloning a filesystem with a tool like 'dd'   copies it,  but if the source filesystem had metadata corruption, so will the copy.



0
 

Author Comment

by:clintonm9
ID: 22785530
This is a production system and would have to be done in the middle of the night.

I just ran the following command and you can see how many files are very big. Also you see the error while this process was running.

ind /home -type f -size +5000000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
/home/websites/mysalonsite/htdocs/data/accounts/xclusivetan159691/pageviews/day/2008/06.19.Services: 6.3E
/home/websites/mysalonsite/htdocs/data/accounts/EliteTan84463/pageviews/day/2006/02.11.About: 6.4E
/home/websites/fsordering/weblogs/error_log: 11G
/home/websites/fsordering/htdocs/images/2728836/products_sample_2.gif: 13E
find: /home/websites/fsordering/htdocs/images/products/s30478L.jpg: No such file or directory

Message from syslogd@ at Thu Oct 23 07:51:42 2008 ...
fsordering kernel: journal commit I/O error

0
 

Author Closing Comment

by:clintonm9
ID: 31508759
the solution will fix the problem, but it will be a big job.
0
 
LVL 23

Expert Comment

by:Mysidia
ID: 22800919
I/O  errors of this nature are serious;  file data corruption may have already occured to some extent (even if it hasn't been noticed yet), and I suggest you also lengthen the amount of time you retain old backups for that server, and be sure to update backups or get as complete a new one as possible,  until the server rebuild can be performed.

Weigh this against the risk of the server going down due to trying to take additional backups.

It just depends on which is more important in your situation;  having up-to-date copies of the data in case the corruption or drive/controller problem creeps further causing loss of info.

Or maximizing the uptime.

I.e.  Is it ok to lose a few days worth of data in this server,  in exchange for the benefit of less downtime?

If this were just a DHCP server,  uptime would be more important, and a few days of data lost would be irrelevent.

On the other hand, if this is a file server, that has users'  home directories...
the loss of a few days data could be costly.

And it might be a good idea to pull out a contingency plan.

Like temporarily offloading the production server's function to a server
normally used for testing.




0
 

Author Comment

by:clintonm9
ID: 22801440
Thanks for the info. The last two nights i got a good backup. Seems like the big files are not showing up right now. I bet they come back soon though. Thanks for eveyrhting!
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

This document is written for Red Hat Enterprise Linux AS release 4 and ORACLE 10g.  Earlier releases can be installed using this document as well however there are some additional steps for packages to be installed see Metalink. Disclaimer: I hav…
1. Introduction As many people are interested in Linux but not as many are interested or knowledgeable (enough) to install Linux on their system, here is a safe way to try out Linux on your existing (Windows) system. The idea is that you insta…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now