Link to home
Start Free TrialLog in
Avatar of joshvazquez
joshvazquez

asked on

DPM server loses connection to Drobo FS during backup

I have a virtualized environment with a Windows Server 2008 R2 Enterprise (Core) physical server running Hyper-V, containing a Windows Server 2008 R2 Standard virtual machine on which I have installed Microsoft Data Protection Manager 2010. It backs up other VMs to a physical RAID-1 array that is implemented using hardware RAID and made available to the DPM server via Hyper-V's pass-through-disks feature. "Tape" backups are done via Cristalink Firestreamer, which provides a virtual tape library to DPM. Data gets backed up to virtual tape files that reside on the Drobo FS.  
 
I use two sets of 2xWD RE4 enterprise drives in the Drobo, which gives them a proprietary RAID-1. Every week, the Drobo is shut down, and the current disk set is moved to offsite storage. The previous week's disk set is retrieved from offsite storage and used for the next full backup (weekly, alternating between disk sets).  
 
This setup worked fine for almost 2 months before backups started failing every week because the DPM server lost the connection to the Drobo some hours into the backup. The next day, I would have to retry the backup (good thing DPM lets you resume) multiple times to get it to complete. If I try to back up to virtual tapes using another VM host as the target instead of the Drobo, the backups always succeed. The messages I get from FsHelperSvc (Firestreamer, event 30012) in the Windows Application event log when it fails are:  
 
   10/17/2011 21:58:39 | Error | E002 | L1T002 | The media drive reported the following error: The specified network name is no longer available [C000020C] | file://\\<path_to_drobo>\<tape_name>.fsrm*12288 | <tape_barcode>
    10/17/2011 21:58:39 | Error | E012 | L1T002 | Unable to write data to the medium. | file://\\<path_to_drobo>\<tape_name>.fsrm*12288 | <tape_barcode>
 
And in the System log:

Source: clfs3mtp
Event ID: 7
Date and Time: 10/17/2011 9:58:39 PM
Description: The device, \Device\TapeDrive1, has a bad block.

"The specified network name is no longer available" is an error returned to Firestreamer by Windows when the computer is unable to access the network share, so Firestreamer is likely not the problem.  

Drobo Support has not been able to resolve the problem. They've had me update Drobo Dashboard, update the Drobo firmware, switch Ethernet cables, copy files to the Drobo's file share using Windows Explorer, check for static IP, uninstall and reinstall Drobo Dashboard, and downdate the Drobo firmware. None of this has helped. Drobo diagnostic files are encrypted, so there's no way for me to see what the Drobo experiences. Support has told me that its network connection is up at the time of the failure, but I still suspect that it's dropping the connection or that maybe there is file system corruption.  
 
I'll stress that I can access the Drobo from Windows Explorer on the server, so it's not like the server can't contact the Drobo at all under normal circumstances.

The most recent thing I tried was turning off disk spindown in the Drobo's settings. I also installed this hotfix from Microsoft.
Avatar of Gerwin Jansen
Gerwin Jansen
Flag of Netherlands image

Hello joshvaquez, nice setup you have. I like the part where you move the disk set offsite.

The eventlog messages you show have a tapedrive related error message as well, do you know when these tape error messages started? What happens when streaming to the tape drive is running and you get this 'bad block' message? Does the back process stop at that moment? Timestamps of all messages are the same so it's hard to tell what happens first.
Avatar of joshvazquez
joshvazquez

ASKER

Fortunately the System event log still has older events, so I see that the tape drive event started occurring at the same time backups started failing (7/16/2011). I can't tell if that event is the cause or the effect. Keep in mind that these are not physical tapes.

A few seconds after these events, I get an email alert from DPM telling me that the backup has failed ("DPM encountered a critical error while performing an I/O operation on the tape xxxx").

Possibilities:

a) Firestreamer detects a bad block on a virtual tape, possibly caused by file system corruption on the Drobo. It logs the "bad block" event. Something happens now to drop the connection, and the other two events (specified network name, unable to write data) are logged because the connection has been lost. The backup then fails.

b) Something causes a connection drop, and the top two events are logged. Firestreamer logs the "bad block" event because the connection has been lost while the virtual tapes are still mounted, so it has a problem trying to read or write to a tape that it thinks is still there (similar to pulling out an external hard drive without unmounting it). The backup then fails.
I agree that it is quite difficult to determine what happens first (cause and effect).

Is there a way to make sure the connection stays open? On a workstation (W7) that kept on loosing samba connections I created a simple script triggered by the Task Scheduler every 5 minutes. The script just performs a 'net view \\<smb server>\<smb share>' that kept the connection open. Worked perfectly for that workstation, maybe you can give it a try.

About the date the backups started failing: what other things happened on that day? Were there any updates installed, hardware added/removed, anything other that you remember or can gather from the logs?
I can do "net view \\drobo", but not "net view \\drobo\share1". The latter gives me this:

System error 53 has occurred.
The network path was not found.

I'll try a script with the first command and see how it goes.

No updates were installed on the DPM server on or around that day. No hardware changes were made, as best as I can remember.

There are some possible solutions in this question, so I'll try those out as well.
Ok, let me know if you make any progress. Note that the "System error 53" is an error we experienced with the W7 / Samba combination as well. We had to upgrade our Samba server but there was some (client) registry setting as well. I can look this up for you tomorrow.
When I retry the backup, occasionally I'll see it fail while I'm working. Should I start monitoring something and wait for it to fail to see if I can get more information? For example: ping -t drobo, or some performance counters. This is a good opportunity because normally it will fail overnight while I'm not on.
You can try ping -t - but this will not keep the connection open to your drobo. I used the "net view" command to do this as I noticed that Windows will disconnect 'unused' connections after 15 minutes. That's why I scheduled the script every 5 minutes. Maybe you can just open some document on your Drobo with Word but I don't know if you have it installed on your server.
Created a batch file to execute "net view \\drobo" and scheduled it to run every 5 minutes. Backup ran for a few hours before failing as before.
Too bad, I've been reading about the Drobo diagnostics logfiles and their encryption. Here's a procecure that should allow you to read them, just by uploading the logfiles to your desktop. Does that work for you?
That works for older Drobos, but unfortunately the newer Drobos encrypt the files. Haven't been successful decrypting them using a script.
Can you post a few lines of such a logfile (or complete if not too big) so we can have a look at it?
I'd rather not post it, as it may contain information about my network or servers that I don't want public. Sorry.
Ok, no problem, I can agree on that :) I'm thinking we are running out of options here with your Drobo device.

Are you able to do the backup to a different device for testing purposes? I mean another storage device like a NAS or a 2Tb USB device? That way you could determine if the problem is in the Drobo device or on your server side?

If you don't have the problem with another device, I'd start complaining with Drobo.
Come to think of it, a few previous backups have failed the same way, with the same errors, on a D-Link DNS-323 NAS. The NAS had been giving us other problems that I can't quite remember, so I thought the failures had something to do with those. We still have it, so I'll try backing up to that. All attempts to back up to a different server have been successful though.

Could it be a bad switch port? I'll change that around also.
>> Could it be a bad switch port? I'll change that around also.

Well, guess this is possible of course. If you're going that way, you may have to check more components. But changing a switch or port can be done relatively easy.
Different port did not give any improvement. Also disabled IPv6 on the DPM server as suggested in another thread, no improvement. I can't try backing up to the D-Link yet because I am retrying the regular backups until they complete, so that I can take the disks offsite. After all backups eventually complete I'll try the D-Link.

I'm going to try using Network Monitor on the DPM server while this backup is going to see if I can catch some packets at the time of failure and maybe see what's going on. The problem is that there are way too many packets (200,000+ in 5 minutes, and this is only packets to or from this server) and I don't know what to look for. Any suggestions on filtering the capture or a tool for analyzing the log?

Thanks.
If you're going to capture, I'd set a filter on error or retransmit packets only, what tool are you going to use, Wireshark?
I'm using Microsoft Network Monitor, but I can try Wireshark as well.
Now I can't even get it to fail while Network Monitor is on...
Backup ran for 8 hours straight yesterday (to Drobo) with Network Monitor capturing packets. I stopped NM so it wouldn't cause the system to crash (too many packets) and went home, and after 3.5 more hours the backup failed.

Today, it has run for almost 7 hours with NM on, and still no failure during capture. I have to cancel this now so I can move the disk pack offsite, and I'll try backing up to the D-Link NAS later. This is bizarre...
Strange but maybe your NM is keeping your network card from 'falling asleep'. Can you check the power management settings on all NIC's?
100% success backing up to D-Link NAS.

I don't see any power management settings on the DPM server's NIC. The Drobo's I can't configure.

Another idea is that excessive heat is affecting the Drobo or the drives, causing a loss of connection or some sort of read error. Too bad there's no way to check the temperature in Drobo Dashboard...
>> 100% success backing up to D-Link NAS.
This is great news :) This means no server or network issue but strongly points to the Drobo device IMHO.

>> Too bad there's no way to check the temperature in Drobo Dashboard...
You think it's running hot? You could setup a couple of fans besides it and test. From the Drobo suppport site:

Temperature: Every Drobo storage device has an internal temperature sensor. This controls operation of its cooling fan. In extreme cases the internal temperature may approach an unsafe level. If any of the drives reach their maximum operating temperature of 60°C, the Drobo device will shut itself down to protect the disks and its electronics.

This would mean that you don't have a temperature problem, you would notice when the Drobo shuts down, right?

I'm starting to think that your Drobo device has a few 'unwanted' features and that support is not helping you. I'd draw my conclusion at this point.
>> This would mean that you don't have a temperature problem, you would notice when the Drobo shuts down, right?

It could be shutting down and causing the loss of connectivity, but unless it turns on again by itself, that theory doesn't work. The Drobo is always on when I come back.

I'll try contacting Drobo Support again and see how far I get (because they've stopped responding to me on the existing ticket). Thanks for all of your help.
You're welcome, please post any answers you may get, it's valuable info for this case and further queries. Thanks.
They told me the Drobo FS is not certified for use with virtual machines and that this could be why I'm having problems. Is this even relevant? Is there any reason a virtual machine would have difficulty accessing some network devices at some times?

According to them, everything is fine with the unit, temps are good, fan is good, no errors, no NIC drops. They're sending me another unit, but not guaranteeing that this will fix it. I'm stumped.
>> They're sending me another unit, but not guaranteeing that this will fix it. I'm stumped.

Well at least they are supporting you with some hardware. Then again your D-Link is working so the issue must be with the Drobo device.

Thanks for your feedback :)
(Preventing auto-close of question)

New Drobo has arrived and will be tested out on Monday.
Still failing, even though everything pointed to the Drobo as the culprit...

Another backup last week to the D-Link also failed. I guess it must be the network or DPM server now.
SOLUTION
Avatar of Gerwin Jansen
Gerwin Jansen
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The DPM server is virtualized, so I can't directly connect anything to it.
Then only (virtual) network components remain? I don't know how to test those...
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for your response. It's unfortunate if that is the case, but recently I've been thinking that the issue is not being caused by the Drobo, because I've had the same failures and the same errors with my D-Link NAS. It could be completely coincidental, because I've seen purple lights on the D-Link more than once, (indicating some sort of failure, even though the drives were good) and no documentation or explanation from D-Link except for them asking me to restart it.

I'll look into some other NAS products. Thanks to both of you for your assistance. Points awarded for effort.
Thanks for the points and your feedback :) Good luck in finding another NAS device.
I'm confident that I have now discovered the solution for this. What I did was lower the volume usage on the Drobo significantly by first reducing my tape retention, erasing now-expired tapes, and removing some tapes from the media map so that that space wouldn't ever be used. I let scheduled backups occur and apart from a couple of disconnections likely due to a scheduled job starting while tapes were still being erased, backups have succeeded! I did this 4 weeks ago. The only failures I've had are due to running out of tapes, so I just need to tweak my backup strategy to provide the appropriate number of tapes.

Note that at no time did the Drobo become full. The virtual tape sets were limited to 1800 GB out of 1860 GB, and even then they were not using the entire 1800 GB. Right now it is at 72% of capacity, previously it was at about 90%.

Given that this occurred with both the Drobo and the D-Link NAS, it could be a problem with Windows Server itself. I'll see about contacting Microsoft Support about this.