Link to home
Start Free TrialLog in
Avatar of DayHelper
DayHelper

asked on

BrightStor ARCserve Backup - logical volumes on 2000 server array becoming 'un-backup-able'.

I have a bias toward what I think is relevant or not, but I'll try to be thourough with my information.

Backup Server:  Windows 2003 w/BrightStor ARCServe Backup version 11.5 sp2 (newest)

Remote System:  Windows 2000 server with a SCSI disk array and many logical volumes.

Explanation:  The backup server handles about 35 server backups.  We have a 2000 Exchange cluster, many Windows 2003 boxes of various varieties, and a few Unix boxes........all using proper 'agents' from BrightStor.  Over the course of 4 months we've gotten all software and agents up to date, got all the servers using open file agents and tackled all those 'backup job' errrors that have lingered for years:

                  ACCEPT ONE!  - The server mentioned above is called, well call it "remote" is a Windows 2000 server as I mentioned, running Windows and Open File agents.  We have other 2000 servers also being backed up without error.  I think the "remote" box is a 'Dell PowerVault 775N?' and is attached to a SCSI disk array.  Logical volumes are like F:, G:, H:, I:, h:, O:, P:,  - you get the idea - about 12 or so.

Problem:  Over the years, and through different versions of agents.......certain volumes have "fallen out of favor" with the BrightStor software and the technician set up an "XCOPY" routine to move those volumes to another server to take the place of the normal tape backup job.  Also those volumes were removed from the job itself.  I'll explain why.

              "remote" is our data server for the entire company.  During the backup, or the night after in pasts years the technician would get a call that nobody could access data and 'remote' server was locked up.  Meaning.......no screen.......no nothing.......and only a power down would fix it.  Whatever volume (H:, O:, Y:, etc) was in the log file on the backup server when the thing locked was removed from the backup job and the job completed the next night.  For example, over the years (N:, O:, Q:, U:, Y:) have been removed from the backup job and are being backup up using an XCOPY cmd file that runs daily.  Of course, this is also means it is not on our normal tapes and is a poor setup.

Previous theories:  Previous technicians have assumed it was a problem with MAC files due to complex, unorthodox naming conventions and another explanation I don't understand regarding a "short name" the MAC files have attached.  Anyway, that was the only "common" factor the tech could find was that those volumes had MAC files on them.

   ****  I was the first one to contact CA tech support this week.  They claim "no unusual file naming conventions or issues with MAC files exist that would cause the backup of any volume to crash the remote server".  This week another volume (I:) went south on us, and started killing the backup.  Sure enough, once I removed the volume (I:) from the backup job, the job worked fine.

    C.A. said it cannot be file names or MAC related and gave a few ideas:

1)  Try running without the Open File Agent - **Did this but I: volume still locked 'remote' server and caused reboot.
2)  Try running without Symantec services running on backup server & remote server. - ** Same effect, 'remote' still crashed.

   Neither of these make sense to me, because why would a 'service' be the issue?  ALL the other volumes are backed up successfully on the same server and this particular volume is a good example of one that just suddenly stopped working.

   I did turn ON detail logging on the test jobs for I: and found it stopped at a folder that contains an 'image' (Norton ghost) file, but CA claims there are no issues with ANY file types.

   ANY experience with this??   Any ideas??

TIA - DH
Avatar of DayHelper
DayHelper

ASKER

Captains log:

  About 1:30am this morning, running my 3rd test backup on "I:" volume on the remote server I left out one of the folders on "I:" (changed the 'day after' my backup failed) that was one of the last things in the last log before the server locked up.  The folder is called "image" and is 46.2gb.  It has two other folders in it, one is the current image of an important PC (28.9gb) and the other folder is a previous image of that same PC (17.2gb).

   CA says 'size' of folder/file should not be issue.  Could it be Windows 2000 trying to deal with the files AS the backup runs??
Calling CA again.  Could this be an issue with the Windows 2000 server itself dealing with such large files with the agent?

  Under the folder those files are only 2.1gb each (.GHS files).
OFA is a natural because it has a file system driver and so has the potential for crashing a system.

I have in my time, three times seen were backing up a specific file caused a system to crash. So it "is" possible that it is the data itself causing the problem. Oh and in each case the data was perfectly normal in every other way and worked just fine in the original application.

The volume size is certainly not an issue, I know people that have volumes in the 12 terabyte size.
A ghost image is also not an issue because there also I have experience backing up Ghost Images.
So for now lets forget about this being a general limitation or problem.

Instead lets consider that there is something about this specific data. As a test "move" the data to a different volume, then run the utils on it to check and defrag it. Then move the data back and run another test backup.
Dovid-

  Last night I did more testing and agree with you that the 'size' is less suspect.  I did searches on the previous 5 volumes that have been removed due to locking up the backup.  None of them, but one......had files over 1gb.  Also, one of the volumes NOT giving issues (G:) is, guess what?, our "image" volume and has tons and tons of those 2.1gb ghost files which are not causing problems.

  From my post above, you see that I removed the "image" folder from 'I:' volume and forgot to mention the backup then was successful on I: volume.

  To prove this was actually not a fluke, I tried to backup up just the "image" folder under I: volume last night, and the server locked up again.  So, now I must seriously consider a 'data' problem as you have mentioned.  The previous technician mentioned trying to "unmount volumes and run chkdsk" in previous years to try and fix the issue, but this may not be what you are talking about.  Can you give me a more detailed idea of how you would 'check' this data?

  Please ask if I can give you any information to help you understand my environment better.

  Many thanks,
   DH
In the previous cases were this came up all the normal testing and diagnostics did nothing to correct the problem. Also the only sign of a problem was when backing up this specific data. In the end only moving the data to a different volume, and that is moving and not copying, running the ck and defrag and then moving the data back got it working.
Are you saying to run 'ck and dfrag' on the suspect data's volume, while it is moved elsewhere?......or are you saying to move the suspect data to another volume and actually work on the data itself?

  DH
ASKER CERTIFIED SOLUTION
Avatar of dovidmichel
dovidmichel
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Dovid.....
   thanks.  One more question.  Only one folder is failing the backup.  Would you recommend just moving THAT folder off the volume before running defrag.....or ALL data off the volume??

  DH
Just that folder.
Dovid.....

      Thanks for your help.  Wouldn't you know the very weekend I plan to invest time to troubleshoot a 'remote' server that is having problems related to backup, my 'backup server' fails.  (AMLI: ACPI BIOS is attempting to write to an illegal IO port address (0x70), which lies in the 0x70 - 0x71)  Wonderful huh?  Anyway, twice the parenthesized error occured and restarted the server, which trashed my full backups.  This is in no way related to the other problem, but will delay my troubleshooting.

     I asked for 'experience and ideas' and got exactly that.  You have helped me to the next steps and resolution at this point is a process of elimination.  I'm going to accept your answer at this time and award points.

    Thanks again.

    DH
Thanks. If you should happen to need more help on this open another question with 0 points and pointing back to this one and we will continue.
Very kind.  Thanks.