This is a long and tough one. If YOU provide the answer that gives me three consecutive full backups that are all successful, I'll award you the points and post a second question worth another 500 points for you only.
August 19, 2002: The story so far
Operating Systems: Windows 2000 SP2 (have not tried SP3 - will try shortly), multiple installations.
Software - 4 installations of Veritas 8.6 build 3878 with drivers from late may; one installation's internal Windows version of Backup.
Drivers - as noted above, Veritas Tape Drivers from late may were used for all attempts with Veritas. For internal Windows backup, Quantum drivers downloaded on approximately Aug. 15, 2002 were used. for the Overland Data units, drivers downloaded on approcimately July 15, 2002 were used.
Overland Data LXN 2000 (unit 1, 2 drive heads)
Overland Data LXN 2000 (unit 2, 2 drive heads)
Dell PowerVault 136T SDLT (1 drive head)
Dell PowerVault 120T DLT1 (1 drive head)
Adaptec 39160 Dell OEM (unit 1)
Adaptec 39160 Dell OEM (unit 2)
Adaptec 29160 OEM?
Onboard Controller AIC 7899 (on Dell 2550)*
SCSI cable 1 (mini68 - VHDCI)
SCSI cable 2 (mini68 - VHDCI)
SCSI cable 3 (VHDCI - VHDCI)
VHDCI Terminator 1 (included with Overland Data, unit 1)
VHDCI Terminator 2 (included with Overland Data, unit 1)
VHDCI Terminator 3 (included with Overland Data, unit 2)
Computer 1 - Dell PowerEdge 2550
Computer 2 - Dell PowerEdge 350
Computer 3 - Generic, home built PC - 1GHz/512, used 29160
Backup size - currently, as one job 830 GB. As two jobs (job 1 - our cluster - 540 GB, Job 2 - everything not on the cluster - 290GB)
Tapes - no tape is older than 6 months. Of the 60 or so tapes tried, at least 40 have been brand new, either branded Quantum or Maxell.
Based on the number of errors returned claiming the tape has a bad block, I'm having a 20% +/- 5% tape failure rate - after an initial run back in Feb/Mar of 100% functional over 30-40 tapes.
I have tried various combinations of the equipment above and in every instance a problem has occurred. 90% of the time the problem has been an Event ID 9 and/or a "bad block" error message and/or a SCSI I/O Bus Timed Out. The 10% different errors have been more severe - the tape drive consistantly crapping out after 1 GB of data backed up And/Or having inventories seemingly randomly hang.
I have contacted support at both Overland Data and Veritas. Both seem quite stumped.
What has worked?
The 120T has consistantly worked fine.
The 136T worked flawlessly for the first 6 weeks we had it (dating back to February). Since than, it's performed perhaps 2 full backups and Dell has replaced several parts several times. The Dell unit is no longer connected (and will NOT be reconnected as it is being returned to Dell due to it's miserable performance - which I do now question the source of the performance problems, but regardless, it's going back)
The Overland Data Unit 1 has NOT worked well. The unit is still on grounds, but not easily reconnected. It does appear one of the tape drives originally shipped with the unit was bad. When it was disconnected, the unit ceased having issues with inventories hanging and backups failing at 1 GB.
The Overland Data Unit 2 is the replacement for unit 1. It is still present and the one that we are currently trying to get functioning fully. It too appeared to have a bad tape drive and when we "mixed and matched" drives from both units, we now have a unit that performs sporadically as described below.
Some of the things we've done:
Swapped cables from the Overland Data units and the 120T. 120T worked fine regardless of which cable was used.
As I believe I mentioned before, we have tried a variety of combinations of the equipment listed above.
What's happening now:
When backups are attempted using Computer 1, Veritas, and any of the above SCSI controllers, terminators, and cables, Fulls run between 100 and 150 GB. At which point an error occurs Usually claiming that a bad block was encountered and the backup job ends abruptly.
The Windows supplied backup utility completed one backup (non-cluster) without incident DURING the backup. (The log indicated right before the actual data backup began there were some issues with the SCSI bus). The Cluster backup began and ended within 7 GB claiming a bad block was encountered.
Logs and links are available at http://data.cshl.org/argh
I'm sure there's more, but I've grown weary of writing. I was only half kidding when I suggested to my boss we offer a reward for any help which leads us to the solution. There wasn't a flat out no, but don't count on it.
*Connected a mini-68 directly to internal port