Intermittent Slow Performance Accessing File shares CGS/Windows Storage Server


We have recently installed an 8 node stretch cluster CGS (HP OEM Polyserve) file serving environment.

There are four "live" nodes , 1 Backup Node and 3 additional standby nodes.

We have used mountpoints for disks that are san attached to get over the number of Disks available.

Users access their shares via login scripts and are mapped to their business unit share.

what we are seeing is intermittent but fairly regular, it can be any of the following:
Client PC's XP Home

Browsing using explorer
Browsing from within the APP
Opening the file
Saving the file

The client just hangs with the eggtimer can sometimes take up to a minute to open a file or simply open a folder to the next level can take 20 seconds.

We have our network guys monitoring a connection from client to host using vitalsuite and this gives a breakdown of client/network/server activity during a transaction. What we are seeing when it hangs is 98% ish as server activity.

I have run Ethereal and during a hang there doesnt seem to anything standout with the log/trace.

The performance at all other times is extremely fast.

The hosts are all dual fiber SAN attached 2003 windows storage servers accessing EVA8000 (synchronous CA).

This is proving to be an extremely annoying difficult problem to solve and pin down issue - if I had any hair i would be tearing it out!!!!

User base is 5000, 53 mount points across those four servers equating to around 8TB.

Heres hoping someone can help.
Who is Participating?
CetusMODConnect With a Mentor Commented:
PAQed with points refunded (500)

Community Support Moderator
You should not be running close to 98% on any server, this is obviously a lot of the problem.  Have you installed switches to redirect users to the appropriate area of the network, or are they all bottlenecking through one authentication server?  If so, that is the problem.  Decentralize is the answer -- and keep that server activity below 70% for anything more than 1-5 seconds.
eysanadminAuthor Commented:
Hi the server is not at 98% utilization , it is simply that when using vital suite to view the transaction when opening a file from client to host it shows that 98% of the transaction is host/server based not client.

So if opening a file take 45 seconds or so most of that is happening on the server the server itself has low utilisation when viewing through T.Manager.

The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

OK, that is more like it.  I think you have a server traffic bottleneck, as intimated by --
"Users access their shares via login scripts and are mapped to their business unit share"

All users are going through one authentication server, correct?  So even when they are farmed off to their business share unit, they are still routing through the AUTH server.  Can you compare the CPU utilization on this server at the same time as the server hosting the users data on the SAN array? (i.e. when the slow down occurs).  If so, I think you will find the AUTH server is the bottleneck, not the SAN delivery to the host account.

If you switched segments of your population through different servers, you can avoid this issue.  Is that possible in your topology?
eysanadminAuthor Commented:
We apparently have 5 AD servers that users are directed to within our environment my understanding is that once the user has logged on the authentication is done.

The delays that occur can happen at any time during the time they are connected, here is a SS excerpt during one of these "Hangs" as you can see all I am doing is opening a folder within a mapped share.

No.     Time        Source                Destination           Protocol Info
    343 14.824508          SMB      Trans2 Request, QUERY_PATH_INFO, Query File Basic Info, Path: \Data\GFIS\GL_Files
    344 14.907006          TCP      microsoft-ds > 3030 [ACK] Seq=41701 Ack=11625 Win=65417 Len=0
    386 24.544598          SMB      Trans2 Response, QUERY_PATH_INFO
    387 24.572290          SMB      Trans2 Request, FIND_FIRST2, Pattern: \Data\GFIS\GL_Files\*
    388 24.599779          SMB      Trans2 Response, FIND_FIRST2, Files: . .. GLTR_GLOBE_GB_02_01_20040826.DAT.gz GLTR_GLOBE_GB_03_01_20040826.DAT.gz GLTR_GLOBE_GB_01_01_20040825.DAT.gz GLTR_GLOBE_GB_02_01_20040825.DAT.gz GLTR_GLOBE_GB_03_01_20040825.DAT.gz GLTR_GLOBE_GB_01_01_20040826.DAT.gz

As you can see after an immediate Ack from the server back to the client, there is a 10 second delay before a response is given back to the host.

Yes, I see that.  From 14.9 to 24.5 is a HUGE delay.  This is just to find the users path to his home dir or a shared path on (presumably) another server linked to the SAN.  So there are 2 possibiliites here --

(1) the server you are going through is getting a 10 second delay from the other AD server where the host file path resides -- in which case, you need to investigate the switches or AD response from this to the other server, test them for response or latency.  I assume your network cables are top line, so I suspect this is less likely to be the problem.

(2) the server through which the request is going has maxed out, and is peaking at 100% usage for those 10 seconds.  In this regard, the assumption "once the user has logged on the authentication is done" is correct as far as authentication goes, but his traffic is still going through this server to get to the rest of the network, and if 5000 peoples traffic is still going through that same server, it will create a bottleneck.  

Let me ask another way, what system do you have in place, be it a switch or an AD server, to route each persons requests directly to the server where their account and files lie, after authentication?  You can test this by a traceroute.  Say that user 1 needs files from the SAN served by server 5, so if you go to user 1s computer, and do a tracert to the IP address of the server 5 where his files are accessed, do you see the IP of server 1 (the AUTH server) in the tracert?  If so, the requests for all packets are still going through the original authentication server, even though authentication is done.

I hope this makes sense and gives you an idea of what to look for.  In debugging bottlenecks, you need to get a grasp on the overall traffic flow on the network, and this is best done by someone who can take the time to debug where the traffic is going, and assess the volume load per server on the network.
I am surprised the questioner didnt close this.  It is worth staying in EE DB, dont you think?
eysanadminAuthor Commented:
Hi   Sorry for not coming back sooner.

It turned out to be the DLM (Cluster communication) network, there were some parameters on startup built in to throttle back performance for slower networks one we had those changed to allow maximum throughput this fixed the issue with the cluster.

This was all based on a Polyserve product and was in the end resolved by them.

ty for all the help
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.