How do I resolve "scsi0: ERROR on channel 0, id 0, lun 0, CDB: " error messages?

slcoit
slcoit used Ask the Experts™
on
Running VMWare 3.5 on Dell 2950’s utilizing EMC CX320 SAN.
      Created a Virtual Server with a host O/S of Red Hat Enterprise Linux 4.
            Set up as an Oracle Application Server.
We had 4 drives go bad on the SAN and we replaced those drives.
After that we moved entire system to new location across town.
Now this vServer is showing the following daily.

Logwatch report contains:

      ---------------------- pam_unix End -------------------------

      Jan 26 05:49:43 end_request: I/O error, dev sda, sector 39392994
      ……….
      ……….      {Total 136 lines with same message just a different sector number}
      ……….
      Jan 26 10:23:15 end_request: I/O error, dev sda, sector 3939300

      --------------------- sendmail Begin ------------------------


/var/log/messages contains:

      Jan 26 05:49:43 apputil kernel: scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 02 59 16 e2 00 00 80 00
      Jan 26 05:49:43 apputil kernel: Current sda: sense key Medium Error
      Jan 26 05:49:43 apputil kernel: Additional sense: Unrecovered read error
      Jan 26 05:49:43 apputil kernel: end_request: I/O error, dev sda, sector 39392994
            {These lines repeat many times with different ‘Read’ and ‘sector’ numbers}

Ran: badblocks –sv /dev/sda#
I ran the badblocks -sv command on all my partitions and all but sda5 came back with 0 bad blocks found:
/dev/sda5 came back with:      Pass completed, 68 bad blocks found.

Are these bad blocks the cause of the Logwatch messages?
How do I resolve all the above?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
Check firmware and patches for vmware.  FSCK or E2FSCK.  If the drive is failing, replace it.  
President
Top Expert 2010
Commented:
Jmluc123's advice, respectfully is not correct.
The problem is that in the midst of rebuilding your raid you had an unreadable block on a surviving disk.   When the RAID5 was reconstructed, the controller got an unreadable error on one of the surviving disks.   Since RAID5 only has one parity block and you lost one of the blocks at offset X, then there is not enough info to rebuild the stripe.

The emc then marked the entire stripe as unreadable, rather than just fill it full of errors.  This tells any application that it has no idea what is supposed to go there, but it is important and contains data it couldn't restore.  (Lesser controllers would have incorrectly stuffed it with zeros and moved on).

So where do you go from here ...
  You have data corruption at LOGICAL BLOCK 39392994.  This means that that you are likely going to have data corruption for some number of blocks before and after that one, because of the nature of the architecture and stripe size.

The controller will always report this error whenever an app tries to read that block because it needs to be told what to put there.  You need to just fill it with zeros.
Use dd on the physical device i.,e   dd if=/dev/zero of=/dev/sda count=1 start=39392994

Once you write zeros to that block the messages go away.

The badblock program said you have 68 bad blocks, which makes sense since the raid does IOs in stripes for speed

Now remember, you do have data corruption. Files are destroyed.  if you dd to stuff it with zeros you will never learn the file names that are corrupted.  You need to write a shell script to go through every file you have and see which file(s) give read errors.  that will tell you which are damaged.


DavidPresident
Top Expert 2010

Commented:
P.S.  I should have said that files are corrupted, not necessarily destroyed.  The bad block could be in the middle of a 1GB database file, which means the database is corrupt, or it could just as easily be a scratch file in /tmp that is long gone.

The error(s) could also be in some unused space, but with 68 bad blocks, and fact that the O/S detected the error(s), then it is in at least one file the O/S wanted to read.

OWASP: Forgery and Phishing

Learn the techniques to avoid forgery and phishing attacks and the types of attacks an application or network may face.

Author

Commented:
Good explanation dlethe.  Thank you.
My thought now, since there is possibly some file damage, if I remove all files under the affected partition and recover that data from a backup tape from before the problem began, will that data only be written to good sectors?
DavidPresident
Top Expert 2010

Commented:
No need to do that.  The O/S will help you find the file because it can't read it.
So write a script that just enumerates all files from the mount point and reads it into the bitbucket.  When you get an error, you know the file name.

Once you know the files, then you re-write them in-place.  If you just delete the files then the unreadable blocks stay there, because deleting a file is deleting an inode.

Then do dd if=/dev/sda of=/dev/null and see if you got them all.  There may still be some bad blocks, but if they are in unused space, they will clear themselves up when data is created to write to those blocks.

Author

Commented:
I ran some simple commands that should have hit every file.
Ran: du - a > $HOME/filename and then searched the filename and found nothing.
Ran: grep -Ri properties * and pulled too many files but saw nothing.
Ran: grep -Ri myname which displayed nothing.

DavidPresident
Top Expert 2010

Commented:
You need to do ALL files starting in / file system.
You probably looked at less than 5% of them based on what you just supplied.

Author

Commented:
I ran the commands from the root directory and, again,  got nothing.
Have not done anything to system yet, however, have not seen the messages for 5 days as of this time.
DavidPresident
Top Expert 2010

Commented:
Then the bad areas are (thankfully) in areas that are not part of any of your disk files. They could be part of transient files, like something that was once in /tmp, or swap area.

If it is in swap, then the problem will take care of itself, eventually, because once the O/S writes to that area, then it will clear out the error.  No easy way to correct this data, but it should repair itself over time.  (You could actually dd if=/dev/zero of=/dev/sda (and pass it the block number that is bad, and put zeros there).  
The upside if you do that, the error goes away forever. You aren't losing data because that data was already lost.

Difference is that at least now the computer KNOWS the area is bad, so in outside chance that area is part of an unused area like part of a boot block or swap space, then it will recover properly and move on.  Bad news is that the recovery might be important at some time.

So really just ignore messages, you can do more harm then good unless you hook something into the kernel to tell you what application and what the I/O is that reads such blocks.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial