BackupExec 8.x / NT4 Tape Failures

I have been struggling with this for quite some time now.  We are experiencing an extremely high number of tape/backup failures on our servers, and have not been able to track down the source of the problem.  We are currently running:

Proliant DL380 servers
Compaq AIT 100 drives using Sony SDX3-100C tapes
Win NT Server
Veritas Backup Exec 7.3 and 8.2(?) (dont feel like running back to the farm at the moment..)  :)

Event viewer logs show numerous  ID 7's and 11's.  Mainly ID 11: 'The driver detected a controller error on .....'

Backups usually hang the server until we yank the drive, or do a hard-reboot.

We have good availability of drives, and have been able to switch them out.  However, we did find that on one of our drives that was residing on one of the worst servers (backup wise) the eject button was stuck depressed.  (might cause long-term damage to drive/tapes maybe?)

We have leaned more to a batch of bad tapes, due to the fact that we can get a good backup with different tapes.  However, the tapes that are reported good on one server, are reported bad on another and it is very unpredictable as to what tapes are good where.  All Veritas documentation that I have found points to bad media, and bad headers on the tapes.  We have to lean away from SCSI controller errors, due to the fact that this is happening on more than one server and is not an isolated problem.

Any ideas?
PassMark BurnIntest Pro has a tape drive test built into it.

It might not solve your problem but may help you diagnose the issue and do some testing to determine if the problem is software, the drive or the tapes.

Duncan MeyersCommented:
Check that SCSI termination is correct first of all.

If you are using an Adpatec SCSI controller, ensure that SCSI termination is set to Auto or On/Wide.

Check that you either have a terminator pack on the end of the SCSI cable or that termination is set correctly onb the drive (termination is on and term power from the device). I'm having no end of trouble with the HP websiteat the moment otherwise I'd be able to give you more specifics for tape drive itself.

Are these the internal hotswap 50/100 AITs or external ones? If internal is the backplane simplex or duplex? Does HP Library & Tape Tools show drive errors? (yes, it sounds like bad media or worn heads)
derenmAuthor Commented:
They are internal AIT100 drives.  We have been able to replace the drives, and I have verified that termination is on.  (Termination -can- be set on the drive itself..)  I have rarely heard of so many tapes going bad so quickly, though.  I am taking a bunch of them out of rotation as we speak just to make sure.
Internal in a DL380, so they replace the top left hand disk with the tape and it shares the disk backplane. The backplane is terminated already as is the Smart 5i (or which ever other card you are using), termination must be OFF on the tapedrive. Didn't realise it could be set on the hotswap tapes without taking them out of the carrier first.
Even with the non-hotplug internal HP/Compaq drives you never turn termination on on drive itself , there's a terminated cable hiddden away inside the chassis wrapped up and not connected to anything.

Can you describe your servers better, it should say G2, G3 or G4 and under POST and you do not mention if you have additional controllers or external storage - you should be using duplex rather than the default simplex if possible.
Huh? Why do you need third party tools when HP provide LT&T and you already know you have terminated one end of the bus twice?
I am experiencing the same sort of problem on a HP Proliant DL380 G3 on SBS 2003.

Symptoms are:

Veritas Backup Exec V9.1 Build 4691 hangs on loading media and the tape becomes un-ejectable (either with software eject or by pressing the eject button on the drive), when this happens the only way to get the tape out of the drive and run a successful backup is to shut the server hardware down and power back up - this normally allows about 2 days of successful backups before the problem comes back.

HP StorageWorks Library & Tape Tools is unable to generate a support ticket, the error produced is:

The diagnostic function encountered a failure while generating the support ticket.

and 2 identical errors are recorded in the system event log:

Event ID: 11
Source: cpqcissm
Description: The driver detected a controller error on \Device\Scsi\cpqcissm1

I have tried (but neither have made any difference):
Installing SP1 for Backup Exec 9.1.
When the problem occurs, stopping all Backup Exec services and ejecting the tape.

I have also, on advice from HP tech support, downloaded and installed the most recent Proliant support pack for Windows Server 2003 so all device drivers and management software is upto date - this does seem to have improved the situation.

I think the Smart Array 5i controller firmware is upto date - it's V2.58 and HP StorageWorks L&TT is V3.5 SR1

The drive has already been replaced by HP and it's in the bottom left hand hot plug slot in the front of the server + the server has never been opened up from new so all hardware configuration is still at factory settings.

Any suggestions would be greatly appreciated.
derenmAuthor Commented:
Finally have been able to solve the problems!

Here is the order of what I finally had to do to solve the issues.  In some instances, the load on the affected server HAS to be taken into account, but I will go more in depth later.

1.  Upgraded from BE 8.x (and in some rare cases, 7.x) to 9.1 SP1  (the SP update is a -must-!)
2.  Dowloaded newest proliant Support Pack from HP which did a barage of driver updates.  *keep in mind that if a Windows service pack or patch is installed, it is recommended that the Support Pack is reapplied, as per HP.
3.  Upgraded Firmware on SA 5i contoller.  (this in itself should elimiante about %50 of the SCSI bus slowdowns, it was a marked issue from HP)
4.  Removed the Windows driver and ONLY used the Veritas driver.  (In WindowsNT,  so to the control panel and completely remove the NT tape driver.    The Device & Media service will do all the work from there.  If not, re-run the device wizard under BE and tell be to ONLY use the Vertias drivers)
5.  GO THROUGH YOUR MEDIA LIBRARY WITH A FINE TOOTH COMB!!!!!  Chances are, you are do have alot of bad media.  If you have a test server, go through each tape, re-label, erase, and do a test job to try to get the job to produce these errors.  If you have bad media, you are sure to see these problems again.
6.  Also upgrade the firmware on your tape drives.  This was part of our issue as well.
7.  Establish a good system for troubleshooting.  Try your best to determine what hardware is good, what tapes are good and where your problem SCSI controllers are (if applicable.)  THIS IS A MULTI-THREADED PROBLEM!  Organization is key to solving these issues one at a time.
8.  To help diagnose bad tapes, esablish a good system to implement the Media managment system used by Veritas.  This will help you keep track of old tapes, new tapes and other media issues.

Change the SCSI bus from Simplex to Duplex.  On our exchange server in particular, the load was just too much on our server to handle regular backups and typical use.  One or the other, but not both.  To solve this issuse, we slid the whole drive array (4 drives, raid 5) down two slots so the drive array was now living on positions 3, 4, 5 & 6 on its OWN SCSI port, and the tape drive was on its own as well.   Keep in mind that this does require a propriatary SCSI terminator from HP.

I am highly stressing organizaion for this problem.  Initally, I was under alot of pressure to get these issues fixed and was moving to fast to really try to get this out of the way.  Each server of ours was presenting unique issues with backups, however, all the symtoms were the same.

I am stressing again that there is an issue with the media.  It doesn't make sense, but this is majority of your problem.   On servers with heavy load is this more prominant for some reason.  Also, in one case of ours, a bad drive was causing our tapes to go bad.   I don't really know the details, but I am assuming that some type of header on the tape was getting overwritten, preventing the tape to be read anywhere else.  Case in point, it was yet another source of our problems.

BTW, the un-ejectable tape issues can be a pain.   Sometimes the BE services will hang when you try to stop them (then again, I am in an NT environment!)  So the quickest way is just to pull the drive, then stop the services.  Reinsert and restart.


Thanks for the advice Matt, the problems with Veritas loading media and un-ejectable tapes actually seem to have gone away (for now) since installing the up to date Proliant support pack and I have closed the call with HP tech support, the only issue that remains is with HP SW L&TT generating errors when I attempt to create a support ticket although this isn't effecting anything else as far as I can see. I will however by keeping your suggestions for use if the problem comes back.

derenmAuthor Commented:
Hmm...  maybe I went a little deep on that, eh?  :)

Out of curiosity, were you getting event 9's with the ID 11's?

No Event 9's!
