[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

DPM server loses connection to Drobo FS during backup

Posted on 2011-10-18
37
Medium Priority
?
2,406 Views
Last Modified: 2013-11-21
I have a virtualized environment with a Windows Server 2008 R2 Enterprise (Core) physical server running Hyper-V, containing a Windows Server 2008 R2 Standard virtual machine on which I have installed Microsoft Data Protection Manager 2010. It backs up other VMs to a physical RAID-1 array that is implemented using hardware RAID and made available to the DPM server via Hyper-V's pass-through-disks feature. "Tape" backups are done via Cristalink Firestreamer, which provides a virtual tape library to DPM. Data gets backed up to virtual tape files that reside on the Drobo FS.  
 
I use two sets of 2xWD RE4 enterprise drives in the Drobo, which gives them a proprietary RAID-1. Every week, the Drobo is shut down, and the current disk set is moved to offsite storage. The previous week's disk set is retrieved from offsite storage and used for the next full backup (weekly, alternating between disk sets).  
 
This setup worked fine for almost 2 months before backups started failing every week because the DPM server lost the connection to the Drobo some hours into the backup. The next day, I would have to retry the backup (good thing DPM lets you resume) multiple times to get it to complete. If I try to back up to virtual tapes using another VM host as the target instead of the Drobo, the backups always succeed. The messages I get from FsHelperSvc (Firestreamer, event 30012) in the Windows Application event log when it fails are:  
 
   10/17/2011 21:58:39 | Error | E002 | L1T002 | The media drive reported the following error: The specified network name is no longer available [C000020C] | file://\\<path_to_drobo>\<tape_name>.fsrm*12288 | <tape_barcode>
    10/17/2011 21:58:39 | Error | E012 | L1T002 | Unable to write data to the medium. | file://\\<path_to_drobo>\<tape_name>.fsrm*12288 | <tape_barcode>
 
And in the System log:

Source: clfs3mtp
Event ID: 7
Date and Time: 10/17/2011 9:58:39 PM
Description: The device, \Device\TapeDrive1, has a bad block.

"The specified network name is no longer available" is an error returned to Firestreamer by Windows when the computer is unable to access the network share, so Firestreamer is likely not the problem.  

Drobo Support has not been able to resolve the problem. They've had me update Drobo Dashboard, update the Drobo firmware, switch Ethernet cables, copy files to the Drobo's file share using Windows Explorer, check for static IP, uninstall and reinstall Drobo Dashboard, and downdate the Drobo firmware. None of this has helped. Drobo diagnostic files are encrypted, so there's no way for me to see what the Drobo experiences. Support has told me that its network connection is up at the time of the failure, but I still suspect that it's dropping the connection or that maybe there is file system corruption.  
 
I'll stress that I can access the Drobo from Windows Explorer on the server, so it's not like the server can't contact the Drobo at all under normal circumstances.

The most recent thing I tried was turning off disk spindown in the Drobo's settings. I also installed this hotfix from Microsoft.
0
Comment
Question by:joshvazquez
  • 18
  • 16
35 Comments
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 36991588
Hello joshvaquez, nice setup you have. I like the part where you move the disk set offsite.

The eventlog messages you show have a tapedrive related error message as well, do you know when these tape error messages started? What happens when streaming to the tape drive is running and you get this 'bad block' message? Does the back process stop at that moment? Timestamps of all messages are the same so it's hard to tell what happens first.
0
 

Author Comment

by:joshvazquez
ID: 36993953
Fortunately the System event log still has older events, so I see that the tape drive event started occurring at the same time backups started failing (7/16/2011). I can't tell if that event is the cause or the effect. Keep in mind that these are not physical tapes.

A few seconds after these events, I get an email alert from DPM telling me that the backup has failed ("DPM encountered a critical error while performing an I/O operation on the tape xxxx").

Possibilities:

a) Firestreamer detects a bad block on a virtual tape, possibly caused by file system corruption on the Drobo. It logs the "bad block" event. Something happens now to drop the connection, and the other two events (specified network name, unable to write data) are logged because the connection has been lost. The backup then fails.

b) Something causes a connection drop, and the top two events are logged. Firestreamer logs the "bad block" event because the connection has been lost while the virtual tapes are still mounted, so it has a problem trying to read or write to a tape that it thinks is still there (similar to pulling out an external hard drive without unmounting it). The backup then fails.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 36995832
I agree that it is quite difficult to determine what happens first (cause and effect).

Is there a way to make sure the connection stays open? On a workstation (W7) that kept on loosing samba connections I created a simple script triggered by the Task Scheduler every 5 minutes. The script just performs a 'net view \\<smb server>\<smb share>' that kept the connection open. Worked perfectly for that workstation, maybe you can give it a try.

About the date the backups started failing: what other things happened on that day? Were there any updates installed, hardware added/removed, anything other that you remember or can gather from the logs?
0
NFR key for Veeam Agent for Linux

Veeam is happy to provide a free NFR license for one year.  It allows for the non‑production use and valid for five workstations and two servers. Veeam Agent for Linux is a simple backup tool for your Linux installations, both on‑premises and in the public cloud.

 

Author Comment

by:joshvazquez
ID: 36996150
I can do "net view \\drobo", but not "net view \\drobo\share1". The latter gives me this:

System error 53 has occurred.
The network path was not found.

I'll try a script with the first command and see how it goes.

No updates were installed on the DPM server on or around that day. No hardware changes were made, as best as I can remember.

There are some possible solutions in this question, so I'll try those out as well.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 36996260
Ok, let me know if you make any progress. Note that the "System error 53" is an error we experienced with the W7 / Samba combination as well. We had to upgrade our Samba server but there was some (client) registry setting as well. I can look this up for you tomorrow.
0
 

Author Comment

by:joshvazquez
ID: 36996288
When I retry the backup, occasionally I'll see it fail while I'm working. Should I start monitoring something and wait for it to fail to see if I can get more information? For example: ping -t drobo, or some performance counters. This is a good opportunity because normally it will fail overnight while I'm not on.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 36996338
You can try ping -t - but this will not keep the connection open to your drobo. I used the "net view" command to do this as I noticed that Windows will disconnect 'unused' connections after 15 minutes. That's why I scheduled the script every 5 minutes. Maybe you can just open some document on your Drobo with Word but I don't know if you have it installed on your server.
0
 

Author Comment

by:joshvazquez
ID: 37003206
Created a batch file to execute "net view \\drobo" and scheduled it to run every 5 minutes. Backup ran for a few hours before failing as before.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37008716
Too bad, I've been reading about the Drobo diagnostics logfiles and their encryption. Here's a procecure that should allow you to read them, just by uploading the logfiles to your desktop. Does that work for you?
0
 

Author Comment

by:joshvazquez
ID: 37009959
That works for older Drobos, but unfortunately the newer Drobos encrypt the files. Haven't been successful decrypting them using a script.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37010435
Can you post a few lines of such a logfile (or complete if not too big) so we can have a look at it?
0
 

Author Comment

by:joshvazquez
ID: 37026475
I'd rather not post it, as it may contain information about my network or servers that I don't want public. Sorry.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37026709
Ok, no problem, I can agree on that :) I'm thinking we are running out of options here with your Drobo device.

Are you able to do the backup to a different device for testing purposes? I mean another storage device like a NAS or a 2Tb USB device? That way you could determine if the problem is in the Drobo device or on your server side?

If you don't have the problem with another device, I'd start complaining with Drobo.
0
 

Author Comment

by:joshvazquez
ID: 37028173
Come to think of it, a few previous backups have failed the same way, with the same errors, on a D-Link DNS-323 NAS. The NAS had been giving us other problems that I can't quite remember, so I thought the failures had something to do with those. We still have it, so I'll try backing up to that. All attempts to back up to a different server have been successful though.

Could it be a bad switch port? I'll change that around also.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37030066
>> Could it be a bad switch port? I'll change that around also.

Well, guess this is possible of course. If you're going that way, you may have to check more components. But changing a switch or port can be done relatively easy.
0
 

Author Comment

by:joshvazquez
ID: 37038891
Different port did not give any improvement. Also disabled IPv6 on the DPM server as suggested in another thread, no improvement. I can't try backing up to the D-Link yet because I am retrying the regular backups until they complete, so that I can take the disks offsite. After all backups eventually complete I'll try the D-Link.

I'm going to try using Network Monitor on the DPM server while this backup is going to see if I can catch some packets at the time of failure and maybe see what's going on. The problem is that there are way too many packets (200,000+ in 5 minutes, and this is only packets to or from this server) and I don't know what to look for. Any suggestions on filtering the capture or a tool for analyzing the log?

Thanks.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37040104
If you're going to capture, I'd set a filter on error or retransmit packets only, what tool are you going to use, Wireshark?
0
 

Author Comment

by:joshvazquez
ID: 37041256
I'm using Microsoft Network Monitor, but I can try Wireshark as well.
0
 

Author Comment

by:joshvazquez
ID: 37048097
Now I can't even get it to fail while Network Monitor is on...
Backup ran for 8 hours straight yesterday (to Drobo) with Network Monitor capturing packets. I stopped NM so it wouldn't cause the system to crash (too many packets) and went home, and after 3.5 more hours the backup failed.

Today, it has run for almost 7 hours with NM on, and still no failure during capture. I have to cancel this now so I can move the disk pack offsite, and I'll try backing up to the D-Link NAS later. This is bizarre...
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37050916
Strange but maybe your NM is keeping your network card from 'falling asleep'. Can you check the power management settings on all NIC's?
0
 

Author Comment

by:joshvazquez
ID: 37057654
100% success backing up to D-Link NAS.

I don't see any power management settings on the DPM server's NIC. The Drobo's I can't configure.

Another idea is that excessive heat is affecting the Drobo or the drives, causing a loss of connection or some sort of read error. Too bad there's no way to check the temperature in Drobo Dashboard...
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37059270
>> 100% success backing up to D-Link NAS.
This is great news :) This means no server or network issue but strongly points to the Drobo device IMHO.

>> Too bad there's no way to check the temperature in Drobo Dashboard...
You think it's running hot? You could setup a couple of fans besides it and test. From the Drobo suppport site:

Temperature: Every Drobo storage device has an internal temperature sensor. This controls operation of its cooling fan. In extreme cases the internal temperature may approach an unsafe level. If any of the drives reach their maximum operating temperature of 60°C, the Drobo device will shut itself down to protect the disks and its electronics.

This would mean that you don't have a temperature problem, you would notice when the Drobo shuts down, right?

I'm starting to think that your Drobo device has a few 'unwanted' features and that support is not helping you. I'd draw my conclusion at this point.
0
 

Author Comment

by:joshvazquez
ID: 37060029
>> This would mean that you don't have a temperature problem, you would notice when the Drobo shuts down, right?

It could be shutting down and causing the loss of connectivity, but unless it turns on again by itself, that theory doesn't work. The Drobo is always on when I come back.

I'll try contacting Drobo Support again and see how far I get (because they've stopped responding to me on the existing ticket). Thanks for all of your help.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37061712
You're welcome, please post any answers you may get, it's valuable info for this case and further queries. Thanks.
0
 

Author Comment

by:joshvazquez
ID: 37065500
They told me the Drobo FS is not certified for use with virtual machines and that this could be why I'm having problems. Is this even relevant? Is there any reason a virtual machine would have difficulty accessing some network devices at some times?

According to them, everything is fine with the unit, temps are good, fan is good, no errors, no NIC drops. They're sending me another unit, but not guaranteeing that this will fix it. I'm stumped.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37066468
>> They're sending me another unit, but not guaranteeing that this will fix it. I'm stumped.

Well at least they are supporting you with some hardware. Then again your D-Link is working so the issue must be with the Drobo device.

Thanks for your feedback :)
0
 

Author Comment

by:joshvazquez
ID: 37089397
(Preventing auto-close of question)

New Drobo has arrived and will be tested out on Monday.
0
 

Author Comment

by:joshvazquez
ID: 37116487
Still failing, even though everything pointed to the Drobo as the culprit...

Another backup last week to the D-Link also failed. I guess it must be the network or DPM server now.
0
 
LVL 38

Assisted Solution

by:Gerwin Jansen, EE MVE
Gerwin Jansen, EE MVE earned 1200 total points
ID: 37117669
I agree that your backup device (Drobo or D-Link) is most likely not the issue. Getting a bit hard to debug this, any way you can connect your backup device directly to your server? You could rule out network problems that way. If it were the server I'm not sure if I can think of a way to test that.
0
 

Author Comment

by:joshvazquez
ID: 37134832
The DPM server is virtualized, so I can't directly connect anything to it.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37154803
Then only (virtual) network components remain? I don't know how to test those...
0
 
LVL 18

Accepted Solution

by:
LesterClayton earned 300 total points
ID: 37248820
I've just been notified about this issue, and I'm sorry to say that through personal experience with the Drobo itself, I believe it to be the cause of the problem.  I have owned a Drobo FS in the past, and sold it on to a colleague because it's just not fast enough for my needs, and when you place huge files on the drobo, the drobo craps out and you lose your connection.  DPM will be placing lots of huge files on the drobo, and the filesystem inside Drobo will just plain and simply not cope with it.

I now use a Synology Diskstation instead of the Drobo FS.

Sorry I couldn't provide more positive feedback.
0
 

Author Closing Comment

by:joshvazquez
ID: 37251304
Thanks for your response. It's unfortunate if that is the case, but recently I've been thinking that the issue is not being caused by the Drobo, because I've had the same failures and the same errors with my D-Link NAS. It could be completely coincidental, because I've seen purple lights on the D-Link more than once, (indicating some sort of failure, even though the drives were good) and no documentation or explanation from D-Link except for them asking me to restart it.

I'll look into some other NAS products. Thanks to both of you for your assistance. Points awarded for effort.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 37252691
Thanks for the points and your feedback :) Good luck in finding another NAS device.
0
 

Author Comment

by:joshvazquez
ID: 37481086
I'm confident that I have now discovered the solution for this. What I did was lower the volume usage on the Drobo significantly by first reducing my tape retention, erasing now-expired tapes, and removing some tapes from the media map so that that space wouldn't ever be used. I let scheduled backups occur and apart from a couple of disconnections likely due to a scheduled job starting while tapes were still being erased, backups have succeeded! I did this 4 weeks ago. The only failures I've had are due to running out of tapes, so I just need to tweak my backup strategy to provide the appropriate number of tapes.

Note that at no time did the Drobo become full. The virtual tape sets were limited to 1800 GB out of 1860 GB, and even then they were not using the entire 1800 GB. Right now it is at 72% of capacity, previously it was at about 90%.

Given that this occurred with both the Drobo and the D-Link NAS, it could be a problem with Windows Server itself. I'll see about contacting Microsoft Support about this.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

After seeing many questions for JRNL_WRAP_ERROR for replication failure, I thought it would be useful to write this article.
"Any files you do not have backed up in at least two [other] places are files you do not care about."
This tutorial will walk an individual through configuring a drive on a Windows Server 2008 to perform shadow copies in order to quickly recover deleted files and folders. Click on Start and then select Computer to view the available drives on the se…
This tutorial will walk an individual through the process of configuring basic necessities in order to use the 2010 version of Data Protection Manager. These include storage, agents, and protection jobs. Launch Data Protection Manager from the deskt…
Suggested Courses

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question