Link to home
Create AccountLog in
Avatar of cmdown
cmdown

asked on

Windows Server - Delayed Write Failed

Dear All
Firstly, my apologies for a prolonged absence from the EE site and also a happy christmas to all members.

We use a conbination of a dedicated backup device and 2 HP servers running Server 2003 to backup our data.  The 2 HP backup servers have stopped working and if we try and either copy a large file or backup (using NTBackup) we received a Dealyed Write Failed error at any point from 100kb to 100Gb of data written to the target.  

The source server is Windows 2003 x86 Std Ed SP2 - SCSI Raid 5
target server 1 is Windows 2003 x86 Std Ed SP2 - HP (NVidia) Sata Raid 5
target server 2 is Windows 2003 x64 Std Ed SP2 - HP (NVidia) Sata Raid 5
The switches are all HP Procurve, speed auto.  I can't determine at this point (offsite) which ports the servers are connected to, but most ports have flow control enabled, with just a few having it disabled in the switch config.
Source server is connected to an HP 4104GL switch
Both target servers are in other builsings connected to HP 2610-48
The primary link between the buildings is OM1 fibre, 100Mb copper to the switches
The source and target server NICs are set to 1000Mb Full Duplex, rather than being set to Auto

I ahve also tried applying the Opportunistic Locks fix referenced elsewhere on EE to no avail.  Microsoft hotfixes for SMB initialise errors only apply to SP1.  

NTBackup used to be quite happy backing up large amounts of data to the 2003 servers.  Please can anyone help resolve this issue?

Many Thanks
Avatar of David
David
Flag of United States of America image

The problem is likely that your disks are timing out because they don't recover from errors quickly enough.   I bet you are using consumer-class instead of enterprise SATA.   The enterprise sata will recover in just a few seconds, while consumer class SATA disks can take 30+ seconds to recover from read/write errors.
Avatar of cmdown
cmdown

ASKER

Hi dlethe

Thanks for rapid reply.  However, both target servers are using enterprise class seagates (ES.2) - 4 x 500GB in Raid 5 arrangement.

I have also tried writing directly to consuder grade 2TB connected via USB2 as I have been able to write some data this way.
I have seen this problem on a specific server that I use for redundant backups of other servers using NTBACKUP.  I pulled out more than a few hairs trying to figure this out.  It turned out that some of these backups had grown in size to the point where they were overlapping (like 2 backups occurring simultaneously). As long as I manage the scheduling so the backups never overlap, this problem does not occur.  I tried all kinds of BIOS updating etc. on the target and nothing changed until I took steps to prevent overlapping backups.  It works fine with that "fix".
Avatar of cmdown

ASKER

I meant to say above that we have been happily backing up to the target servers for over 18  months with no problems until now.
Avatar of cmdown

ASKER

Hi IT Monkey
I'm getting this error with just a single backup hitting the target server.  Network traffic is not an issue at the moment as all users are off this week for christmas
Have you had the controller run data consistency check/repairs yet?  You have a crappy low-end controller, and it does NOT automatically do this.  All you need are a few bad blocks in a row and you get the same problem.

Kick off the parity check/rebuild, or whatever the controller calls it, and check event log on the RAID as well. (Not the event log in windows, but within the controller BIOS, if it has such a thing)
When this problem first cropped up I thought for sure it had to be drive-related but it was happening on multiple drives (these are just single desktop-grade SATA drives, with no RAID).  I was never able to find any evidence of a drive issue though and as long as I prevent the overlapping backups, all seems fine.  I dunno.  This was a head-scratcher.
SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
ASKER CERTIFIED SOLUTION
Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of cmdown

ASKER

Hi Both
Re drive failures - yes we had a couple of the drives fail. (We've also had a brand new Constellation fail - replacement series for ES.2). The drives were replaced and in the cae of 1 server, it was fully rebuilt from the ground up - array wiped, recreated, new OS install, fully patched.  Dlethe - I'm not sure that facility even exists but I'll be able to have a look when I am back on site next week.  As for updates, both servers are fully patched so I'm it is possible that a Win update has introduced something.  I'll try checking for any other driver patches as well but I think they are all in place.  From both our your comments I take it that you are both happy that there is not a network / SMB issue at work here?
Seagate has a firmware update for the ES.2 500GB drives that's supposed to improve their reliability.  My desktop tech knows more about it but he's off this week.  I'm sure you can find the update fairly easily on Seagate's support site.  Might be worth looking at esp. if the update changes how the drives deal with error handling.  Sorry I don't have more details on that.
Bad blocks can happen 24x7.  That is why you need a decent controller that does parity rebuilds 24x7 as well.   Better to let the controller repair a bad block from parity BEFORE you need the data.  

Also since these are crap controllers that don't even have ECC memory, then all it takes is a couple of consecutive bad blocks to experience the timeout.

Actually, these controllers add nothing to the performance equation. You are much better off booting to a 2-disk RAID1 in the controller, then using windows software-based RAID5.   Then windows will manage things, and you'll be able to see problems in event manager, and rebuild / recalc parity as needed.

Those controllers are "OK" for RAID1, but still, they leave 50% of the read performance on the table.  With windows RAID-1, it balances reads so that each disk handles half the I/O requests.  (No advantage on writes).  ALso since the windows O/S already does the caching, you'll have fewer I/Os go to disks in the first place.

So bottom line, best cure is to get rid of the hardware RAID5, go to windows software RAID5.  Those NVIDIA controllers are simply unacceptable for enterprise use.  They belong on grandma's PC that she uses once a month to balance her checkbook :)
Avatar of cmdown

ASKER

HI IT
Thanks - I've got the update for a 1TB but it looked to be data destructive so I didn't proceed (Dell used a lot of the Seagate ES.2 drives in some of their T series servers).  I'll have another look next week.  As Server 1 was rebuilt it has no data on it so it will be worth trying - but as you've already said this is a real head scratcher, esp as it has worked for the last 18 months !
Avatar of cmdown

ASKER

Hi Delthe
The HP servers only support a max of 4 drives (ML115 G5) so I'm stuck - your solution would need 5 min.  
Can you suggest a reasonable 4 port sata raid controller that might be up to the job?  
Given my drive limit, is it worth booting to Disc 1, then building discs 2,3 & 4 as a software raid 5?
No, I would not recommend doing that.   Why not just get an external enclosure?  Plenty of enclosures to choose from on ebay.  LSI makes the vast majority of SAS/SATA controllers that HP, IBM, Dell, Supermicro and others slap their names on.  

Since I am partial to data integrity, especially when it comes to servers, and like to future proof, I would go with one that has a BBU option that supports SAS-2.   You could then attach external enclosures, move it to another system if necessary, whatever.   You get what you pay for when it comes to controllers.,  A good controller can compensate for flaky disks.   I don't know your budget, but when you think about it, a server is a life-support system for data.  Most people put the money in the CPU & RAM, and save a buck on the HDD and controller.   Those people tend to be my customers at one time or another :)

Read specs and buy what you can afford.
Avatar of cmdown

ASKER

All
I've relocated one of the suspect servers and am currently running some further tests but those run so far are leading me to think this may be a network comms issue.  I'm away from tomorrow until 17th and will report back on any progress.
Delayed write due to network?  Not very bloody likely unless the target was a network-based share.  
Avatar of cmdown

ASKER

Hi dlethe

Share on the target was a simple folder share \\TargetServer\Sharename

The relocated HP server has had some success.  With both the source and target servers on the same switch I can now run NTBackup 3 times out of 5 but RichCopy still dosen't work, nor does a simple Windows file copy.  In one of my other open quesions it was suggested I try using SyncBackPro rather than RichCopy.  This has enabled me to archive off the main data area to an external hdd and now I am back I'll also try using it to backup to one of the HP servers.  I've got no staff at the moment so it might be next week before I can report back on any progress.
Avatar of cmdown

ASKER

Hi dlethe

I have just returned from a nasty bout of flu.  The situation is unchanged from my comments of 18th Jan.  Over the next few months we will be replacing some of our key switches - are your happy to leave the question open or would you like me to close the question and post back once I have some more information ?
I stubbornly maintain that those Seagate 500GB ES.2 drives need to be considered as a possible cause, especially if they're early production run with old firmware.   They've been nothing but grief for us.
The ES.2s are very poor drives, so could very well be the culprit, but according to author the delayed write is on a network share.  Now if that particular server was hosting the shared directory on those disk drives, then there may be something in the windows event log at the same timestamp that may be of interest.

So cmdown ... did you examine the event log on the system hosting that share to see if it logged a problem (Like a timeout on a READ??)
After all this effort and traffic, and considering the last response was asking for details which author never supplied, then it should either stay open, or at least charge the points instead of abandoning.  
Avatar of cmdown

ASKER

Hi dlethe

Nothing showed in the logs and I am still waiting for a response from the manufacturer.  As things stand, I agree with you that these supposidly enterprise grade drives are anything but - we've had two more fail - 1TB capacity this time.  Despite that, the fact remains that we can back up to the 'problem' box now that it is on the same physical switch as the backup source .

Given that it is taking some time to elicit more detail, can I propose that I close and award points to both dlethe & it monkey for input as comments made by both are valid and have been useful.  Should we be successful in finding out further information and/or getting a response from the manufacturer I will post details here.
Arguably there is no such thing as a "enterprise" class SATA drive.  Those older models are at best, nearline.  If you want 24x7x365 go fibrechannel or SAS.
Avatar of cmdown

ASKER

hi dlethe

certainly as regards the ES.2 !!

Are you happy with my proposal for closing and awarding points ?
sounds good to me.
Avatar of cmdown

ASKER

Thanks both.

As and when I have more information I will update this question