asked on

How do I resolve "scsi0: ERROR on channel 0, id 0, lun 0, CDB: " error messages?

Running VMWare 3.5 on Dell 2950’s utilizing EMC CX320 SAN.
      Created a Virtual Server with a host O/S of Red Hat Enterprise Linux 4.
            Set up as an Oracle Application Server.
We had 4 drives go bad on the SAN and we replaced those drives.
After that we moved entire system to new location across town.
Now this vServer is showing the following daily.

Logwatch report contains:

      ---------------------- pam_unix End -------------------------

      Jan 26 05:49:43 end_request: I/O error, dev sda, sector 39392994
      ……….
      ……….      {Total 136 lines with same message just a different sector number}
      ……….
      Jan 26 10:23:15 end_request: I/O error, dev sda, sector 3939300

      --------------------- sendmail Begin ------------------------

/var/log/messages contains:

      Jan 26 05:49:43 apputil kernel: scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 02 59 16 e2 00 00 80 00
      Jan 26 05:49:43 apputil kernel: Current sda: sense key Medium Error
      Jan 26 05:49:43 apputil kernel: Additional sense: Unrecovered read error
      Jan 26 05:49:43 apputil kernel: end_request: I/O error, dev sda, sector 39392994
            {These lines repeat many times with different ‘Read’ and ‘sector’ numbers}

Ran: badblocks –sv /dev/sda#
I ran the badblocks -sv command on all my partitions and all but sda5 came back with 0 bad blocks found:
/dev/sda5 came back with:      Pass completed, 68 bad blocks found.

Are these bad blocks the cause of the Logwatch messages?
How do I resolve all the above?

jmluc123

Check firmware and patches for vmware. FSCK or E2FSCK. If the drive is failing, replace it.

ASKER CERTIFIED SOLUTION

David

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

David

P.S. I should have said that files are corrupted, not necessarily destroyed. The bad block could be in the middle of a 1GB database file, which means the database is corrupt, or it could just as easily be a scratch file in /tmp that is long gone.

The error(s) could also be in some unused space, but with 68 bad blocks, and fact that the O/S detected the error(s), then it is in at least one file the O/S wanted to read.

slcoit

ASKER

Good explanation dlethe. Thank you.
My thought now, since there is possibly some file damage, if I remove all files under the affected partition and recover that data from a backup tape from before the problem began, will that data only be written to good sectors?

David

No need to do that. The O/S will help you find the file because it can't read it.
So write a script that just enumerates all files from the mount point and reads it into the bitbucket. When you get an error, you know the file name.

Once you know the files, then you re-write them in-place. If you just delete the files then the unreadable blocks stay there, because deleting a file is deleting an inode.

Then do dd if=/dev/sda of=/dev/null and see if you got them all. There may still be some bad blocks, but if they are in unused space, they will clear themselves up when data is created to write to those blocks.

slcoit

ASKER

I ran some simple commands that should have hit every file.
Ran: du - a > $HOME/filename and then searched the filename and found nothing.
Ran: grep -Ri properties * and pulled too many files but saw nothing.
Ran: grep -Ri myname which displayed nothing.

David

You need to do ALL files starting in / file system.
You probably looked at less than 5% of them based on what you just supplied.

slcoit

ASKER

I ran the commands from the root directory and, again, got nothing.
Have not done anything to system yet, however, have not seen the messages for 5 days as of this time.

David

Then the bad areas are (thankfully) in areas that are not part of any of your disk files. They could be part of transient files, like something that was once in /tmp, or swap area.

If it is in swap, then the problem will take care of itself, eventually, because once the O/S writes to that area, then it will clear out the error. No easy way to correct this data, but it should repair itself over time. (You could actually dd if=/dev/zero of=/dev/sda (and pass it the block number that is bad, and put zeros there).
The upside if you do that, the error goes away forever. You aren't losing data because that data was already lost.

Difference is that at least now the computer KNOWS the area is bad, so in outside chance that area is part of an unused area like part of a boot block or swap space, then it will recover properly and move on. Bad news is that the recovery might be important at some time.

So really just ignore messages, you can do more harm then good unless you hook something into the kernel to tell you what application and what the I/O is that reads such blocks.

How do I resolve &quot;scsi0: ERROR on channel 0, id 0, lun 0, CDB: &quot; error messages?

How do I resolve "scsi0: ERROR on channel 0, id 0, lun 0, CDB: " error messages?