Solved

Very slow data throughput on all drives (25-30MB/s)

Posted on 2008-10-17
24
1,282 Views
Last Modified: 2011-10-19
We have 2 identical servers running 2003 Standard edition with Terminal Services. One is also acting as a Domain Controler. Both servers have a mirrored pair for each Logical Drive (C:\ on one mirror and D:\ on the other). Pagefile usage on both servers are the same, system managed on the D:\ drive with performance optimised for background services rather than Programs.

The problem we are having is the fact that the DC server is getting a drive read/write throughput of around 25 - 30 MB/s where the other server is getting throughput of about 270MB/s. This result is seen on all drives including External SCSI drives and even Pendrives. We get these readings using PerformanceTest V6.1.

We have swapped out the drives between the servers (DC drives into Terminal Server and vice versa). The problem followed the drives. So now DC drives in the Terminal Server chassis has the throughput slowdown, where the Terminal Server drives in the DC chassis are getting the high throughput.

I am attaching a HijackThis log file for a little more insight into the slow performing DC

Thank you in advance for your help on this issue.


hijackthis.log
0
Comment
Question by:janglesea
  • 13
  • 7
  • 4
24 Comments
 
LVL 22

Expert Comment

by:Paka
Comment Utility
Since the performance issue followed the drives, it probably isn't a hardware issue.  Have you checked to see if all configuration items are the same between the two servers?  How about drivers and firmware, are they the same between the two servers?  How much free space do you have on the drives?  Have you checked fragmentation?
0
 

Author Comment

by:janglesea
Comment Utility
To eliminate the hardware, we moved the RAID drive pairs (Mirrored pair for C and mirrored pair for D:). We swapped the drives to another server with identical hardware and drive i/o was down to 29MB/sec. On our other servers we get 300Mb/sec or more.

All our servers are Dell Poweredge 2850 with Dual Xeon 3GHz, 4GB RAM and mirrored (RAID1) 73GB HDD for C:, same for D: and one global hot spare. All drives are SCSI 320 10K rpm. In the spare server the problem moved with the C: RAID1 pair so we are almost sure this is a Windows issue not hardware. Imaging the Server 2003 onto another drive/drive pair will confirm this. The use of the spare server eliminated everything else.

O/S is Server 2003 standard, the GC is slow, our PDC is fast (disk transers 300MB/sec). The four servers we have are all Dell 2850. The first two we bought have Perc 4/DC RAID controllers, the second two servers have Perc 4e/DC. Other than that they are identical.

We have exhaustively scanned the slow server, run performance monitor and process explorer. The disk speed is consistently slow on any drive (RAID logical drives benchmark at 29Mb/sec). While the disk read and write speeds are slow, CPU utilization is negligible (3%) and disk idle time 95%. Network utilization is less than 1%.
Unloading all antivirus software has no effect on the drive speed observed.
We have runTrend and ComputerAssociates on-line and off-line virus scanners and have found nothing.
We have defragmented the drives (twice). There is 25% free space on the C: drive, more on D:
0
 
LVL 22

Expert Comment

by:Paka
Comment Utility
Have you tried breaking the mirror to see if it might be related to a sync issue?  Have you checked the SMART status of the drives?  Are there an abnormal number of read or write CRCs?
0
 

Author Comment

by:janglesea
Comment Utility
As Dell seem to have limited diagnostics on the PErc RAID card we did some wider searching and determined that the Perc 4e/DC is an OEM of the  LSI Megaraid 320-2E. LSI Provide both a newer driver for this card and a management utility which provides performance monitoring and diagnostics. I loaded this on both the slow server and the PDC which we are using as a benchmark (same spec and is PDC while slow server is GC). On the slow server when I run the Passmark hard drive benchmark over 60 secs  there is a surge of data as the test file is written and then a constant drive activity during the remainder of the test as data is read/written to and from the test file. On the fast server, the intial surge is the same and similar order of magnitude but for the remainder of the test period drive activity as shown by the RAID performance monitor is almost zero. Which I first thought was very odd but having thought about it makes me think there is some caching going on on the fast server which is missing on the slow one. The fast server will read/write 10Gb in a minute, whereas the slow one will only manage 880Mb. Ten times slower.

I am going to try taking an image off the slow RAID onto a simple SCSI drive of the same spec and see aht happens. I am not sure how to check CRC to/from the RAID.
0
 
LVL 30

Expert Comment

by:Duncan Meyers
Comment Utility
>read/write throughput of around 25 - 30 MB/s where the other server is getting throughput of about 270MB/s
Cow cookies!

270MB/s sustained performance is simply not possible on a pair of mirrored drives. Even the most optimistic of hard disc drive manufacturers don't claim more than about 90-99MB/sec for a 15K rpm SCSI disc (and that's an optimised  never-seen-in-the-real-world benchmark number), so 270MB/sec must be in error. Are you comparing apples with apples? Is one server reporting results in MB/s (bytes) and one in Mb/s (bits)? Are you sure that PerformanceTest is configured identically on both servers? Is what you're seeing simply small block writes to write cache on the "faster" server? Are you sure that both servers are configured identically? Take PerformanceTest out of the equation altogther: how long does each server take to copy a 4GB file from a network source to the C: and D: drive?

Incidentally, around 30MB/sec is what I'd expect as good performance from the hardware config you've described for a typical random server workload with 50% reads, 50% writes, so you may be chasing shadows.
0
 

Author Comment

by:janglesea
Comment Utility
Alas we only started this hunt when we observed that the slow server in question was taking far too long to backup. We use Arcserve 11.5 over our gigabit LAN. The agent on our other fast servers reports 250-300 MB//Min during backup to our Arcserve server whereas the slow server reports 75-100 MB/min. 300MB/min = 5MB/sec so  even if the performance benchmark is inaccurate (and for comaprison I do not need it to give  a calibrated figure for disk transfer) we are comparing apples with apples and have a real discrepancy.  I accept that the performance benchmark may be amplifying the real state but it is a genuine difference. We have done local and across-the-LAN transfers of both large files (500Mb) and sets of small files. The slow server is still 2-3 times slower than any other box on the network running real transfers. We have checked the settings on identical servers' BIOS, RAID, LAN and Server 2003 settings and still can't see anything different except this persistent slow drive throughput.

As suggested by PAKA I checked the drive SMART status and they are happy. Also did a resynch of the RAID - no problems. The RAID card has internal error logging in firmware and there are none.
0
 
LVL 22

Expert Comment

by:Paka
Comment Utility
Have you run your backup tests using identical files?
0
 

Author Comment

by:janglesea
Comment Utility
Hi
Yes we have a test dataset of 10Gb which is a mix of Office files and Sage backup files (accounting records in compressed format) some of the latter are several hundred meg most about 50mb. This is a good mix to test. Cut and paste between the slow server and the backup server is slow, between a normal server and the backup server is fast. So is backing up. We have tried backing up with and wthout the Arcserve client but the difference remains the same.

The bottleneck is disk I/O on the slow server. It is not just reading, writing is also slow.

There is an antivirus program on all servers but we have this disengaged from scanning and realtime operation during the backup.
0
 
LVL 22

Expert Comment

by:Paka
Comment Utility
Are read and write caching configured the same between the servers?  How about from Windows?
0
 

Author Comment

by:janglesea
Comment Utility
The raid controller driver does not provide for user configuration of the caching in Windows. (Device Manager, Disk drives, diskx, Policies; shows caching on but is greyed out).  I cannot find any other tuning parameters in Windows to change. Caching is configured on the RAID card using either the LSI management utility or the card BIOS. I have set up the card so that the disk i/o is cached, this is the same on all servers (and we have tested the card by substitution as described in the above comment).

0
 
LVL 30

Expert Comment

by:Duncan Meyers
Comment Utility
That helps clarify things. Download a trial version of Diskeeper from http://www.diskeeper.com/defrag.asp and defrag the discs and see how you go, I've seen backup times cut in half by defragging.
0
 

Author Comment

by:janglesea
Comment Utility
We have as stated in the above comment defragged the hell out of these drives we use PerfectDisk Server 2008 to position files on faster hard drive area, make files contiguous and defrag free space. It isn't a fragmentation issue, sadly.
.
0
Free Gift Card with Acronis Backup Purchase!

Backup any data in any location: local and remote systems, physical and virtual servers, private and public clouds, Macs and PCs, tablets and mobile devices, & more! For limited time only, buy any Acronis backup products and get a FREE Amazon/Best Buy gift card worth up to $200!

 
LVL 22

Assisted Solution

by:Paka
Paka earned 50 total points
Comment Utility
How many users are you supporting with the DC?  Which PerformanceTest Disk statistic are you quoting for the 25MB/S and 270MB/S rates?  (Disk - Sequential Read, Disk Sequential Write, Disk Random Seek + RW, Disk Mark, or PassMark Rating)

How does the CD Read Test look on both machines?  Have you run PerformanceTest on any other server class machines?

Sorry about the number of questions...
0
 

Author Comment

by:janglesea
Comment Utility
Questions are good, thank you so much for your time.

We have up to 40 users on the DC, but we have done the tests when the systems are off-line to users.

The Passmark tests we are running are Sequential read and sequential write, setting up a test job to do both simultaneously over a period of 1 minute using a 1Mb test file and 4096byte block size. We are running like-for-like tests i.e. with no LAN activity, users logged in, T/S sessions active, anitivirus active or background tasks. As mentioned above we are making sure that  there is no CPU utilization or drive activity present when we run the tests (and we have also run local and across-the-LAN real-world tests using a mixed set of 10Gb of data files -102,077 files from word docs to zipped archives). The Passmark results may not be strictly accurate i.e. what is says is 270MB/sec may be bits/sec instead of bytes  or anything you like but the apples-for-apples passmark  tests show about 10 x difference between two hardware-identical DCs running Windows Server 2003 SP2 patched to latest updates..

I will run the CD test and report back.

 A passmark test of a USB memory stick showed 25MB/sec in the slow server, 200MB/sec on the normal server.
 
 
0
 

Author Comment

by:janglesea
Comment Utility
OK CD test results from Passmark 30 sec test 32Kb block size
Normal server uncached Win32 API 0.87MB/sec (=6 X CD)
Slow server uncached Win32 API 0.88MB/sec (=6 X CD)
So no difference in the CD reading test.
0
 

Author Comment

by:janglesea
Comment Utility
Further caching information...

I just found that if I go into the disk manager, select any hard drive and check its properties->hardware then select a Dell logical SCSI drive drive and look at policies, there is a checkbox to enable write caching. This is not shown where I normally look under device manager->disk drives->properties->hardware etc. (see above comment).  However as both the slow and normal servers have the write cache disabled I do not think that this is relevant. The RAID controller and physical drives will be caching whether this checkbox is set or not. And it would only affect write speed.
0
 
LVL 22

Expert Comment

by:Paka
Comment Utility
The odd part is that the performance hit is affecting the USB device too.  We'll need to look at something that has USB and SCSI I/O in common. This also indicates it is likely a Windows problem and not hardware (at least not the entire PERC to harddrive subsystem).  Have you checked for shared interrupts?
0
 
LVL 30

Expert Comment

by:Duncan Meyers
Comment Utility
200MB/sec is still nonsense - what you're seeing is write cache effects. My guess is that you have write-thru set on one PERC and write-back on the other.

Going back to your post about backup performance - that reads to me to be a performance differential of 3x. I suspect that the benchmarks are leading you up the garden path. Backup performance is all about reading from the server being backed up whereas you're chasing a write performance issue.

Can you test read performance and post the results please?

0
 

Author Comment

by:janglesea
Comment Utility
OK 200M/sec is probably nonsense. If I could fix the 3 x speed difference we would be happy. But it definitely isn't a Perc setting - moving the RAID drives to a normally operating server would have picked that up - the problem moved with the RAID drives. When we put the RAID pair in an identical server and booted the slow Server 2003 installation we still had the problem.

I am going to test the read performance by copying my 10GB mixed test dataset locally and across the LAN to and from  the slow and normal servers. I'll post the results later.
0
 
LVL 30

Expert Comment

by:Duncan Meyers
Comment Utility
PERC configuration is carried on the drives themselves and will have been transferred to the spare server along with the drives (unless you recreated the config), so you may want to review the PERC configuration. Take a look at write-thru vs write-back and block size. Also the state of the battery on the PERC will determine wether or not you can turn write-thru on
0
 
LVL 22

Expert Comment

by:Paka
Comment Utility
Is the cluster option enabled on the PERC on the slow system?  I would try resetting the configuration on the PERC to see if it has any effect.  Have you checked to see if the SCSI bus is terminated?  

Can you temporarily pull the old drives and temporarily connect a new drive to the slow system and install a fresh copy of Windows to see if it is a Windows related issue?

As mentioned before, if your SCSI and USB busses are both affected by the slow performance, it's not likely to be a PERC infrastructure issue.
0
 

Author Comment

by:janglesea
Comment Utility
I have taken a backup of the slow server with V2i Server Protect and restored it to an external SCSI HDD. I then connected that HDD on a normal working server through an Adaptec SCSI card (not RAID), booted up Windows and the problem was still there.

So I doubt that it has anything to do with the PERC card. The last suggesttion from Microsoft was to defrag the drives funnily enough. I am waiting to hear back from them with more suggestions.
0
 

Author Comment

by:janglesea
Comment Utility
As promised, here are the throughput stats for local and across the LAN copies of a 10Gb file set of real data, copied and pasted between local RAID drives on the RAID controller or across the LAN. Slow and normal servers are set up with with identical hardware configurations. As usual antivirus scanning of any kind was disabled and the drives were defragged. This took a while to set up but we are sure that the figures are accurate representations of the differences in server drive throughput:

Normal server 10.7GB transfer local RAID drive to local RAID drive 16Min = 684MB/min, 11.4MB/sec
Normal server to normal server across the LAN 10.7GB = 280.9MB/min, 4.68MB/sec

Slow server 10.7GB local RAID drive to local RAID drive 34min = 322MB/min, 5.37MB/sec
Slow server 10.7GB to normal server across the LAN 93min = 117.8MB/min, 1.96MB/sec

So our slow server is about as slow copying data locally as a transfer between two normal servers across the LAN, i.e. about 50% the speed it should be either locally or providing data for delivery across the LAN. The latter confirms that this is a read problem so cannot be write caching related.

It might be helpful to recap: of four identical servers, we have one that is showing slow read and write to and from local hard drives - RAID, USB and SCSI. Despite extensive monitoring and testing, including moving the boot mirrored pair of drives from the slow server to an identical chassis and taking an image of the slow server we have not identified the cause. Using the image completely eliminated any hardware effects but the slow speed is still apparent. Using the replicated test installation disconnected from the network, we have progressively demoted the slow server to a member server from a DC, un-installed all applications and updates back to base Windows. We tried un-installing SP2 for Server 2003. These latter measures were by way of last resort attempts to identify the cause, since the servers operating normally are fully patched Server 2003 with similar application sets. The slow speed is still apparent.

We have a case in progress with Microsoft to try and resolve but as yet Microsoft techs are as baffled as we are. It is, as they say a toughie.

Many thanks to those who have contributed their thoughts - keep the ideas coming!
0
 

Accepted Solution

by:
janglesea earned 0 total points
Comment Utility
IT'S FIXED - verifier.exe was the culprit - but no  thanks to  Microsoft
Here is the final explanation of the problem and what we did to fix it. And further thanks to all those who contributed their suggestions but no points deserved as far as I can see....

Some time ago  (31st August 2006, Microsoft Case Id - SRQ060831601320 addressed by Milind Bhavsar) we needed to resolve an issue where our server was becoming unresponsive and a deal of debugging was required to trace the problem to Office 2003 under Terminal Services - specifically that we had accepted the default settings to install Office 2003 including the handwriting recognition and multi-language components. These it would appear are incompatible with Terminal Services....
 
 During the SIX WEEKS it took to find the problem in 2006 we allowed the Microsoft  engineer free access to the system on several occasions. It would appear that at some point he used the verifier utility to turn on driver verification for a number of drivers. This was never reset. I diagnosed the issue by visually comparing the registry of the slow server with that of a normal sever. When I removed the entries (keys) VerifyDriverLevel and VerifyDrivers from HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management the problem was solved. I now find that had the Microsoft engineer bothered to issue the command verifier /reset at a CMD prompt then our protracted search for the cause of the present problem would not have been necessary. Our backup is now taking SEVEN HOURS less than before and the drive throughput as measured by passmark has changed from 25MB/sec to 265MB/sec. I don't care if the stats are accurate - the effect has been exactly as expected - much faster throughput and better application response times.

This to me shows very poor practice by the Microsoft engineer - had I not had the patience to work through the slow server configuration and compare it to the normal servers we have in use then this could have taken a very long time to fix. We may have had to reinstall and reload our Windows Server 2003  installation which would have taken a long time to refine to the current level with inconvenience to all users. Yes, we have backups and disaster recovery images but the time since this problem was created renders those of that age rather limited in their value. I had no knowledge of the verifier utility prior to fixing this problem.

So thanks again to those who gave their time to assist with this and maybe the tale will help some other unlucky person who allows Microsoft to access their live server.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
inactive users 13 53
search on network drive not working 4 40
Two hard drives. 36 97
Raid 6 or Raid 10? 19 45
Storage devices are generally used to save the data or sometime transfer the data from one computer system to another system. However, sometimes user accidentally erased their important data from the Storage devices. Users have to know how data reco…
Learn about cloud computing and its benefits for small business owners.
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now