Solved

Intermittent Slow Performance Accessing File shares CGS/Windows Storage Server

Posted on 2006-06-30
10
1,172 Views
Last Modified: 2013-11-15
Hi,

We have recently installed an 8 node stretch cluster CGS (HP OEM Polyserve) file serving environment.

There are four "live" nodes , 1 Backup Node and 3 additional standby nodes.

We have used mountpoints for disks that are san attached to get over the number of Disks available.

Users access their shares via login scripts and are mapped to their business unit share.

what we are seeing is intermittent but fairly regular, it can be any of the following:
Client PC's XP Home

Browsing using explorer
Browsing from within the APP
Opening the file
Saving the file

The client just hangs with the eggtimer can sometimes take up to a minute to open a file or simply open a folder to the next level can take 20 seconds.

We have our network guys monitoring a connection from client to host using vitalsuite and this gives a breakdown of client/network/server activity during a transaction. What we are seeing when it hangs is 98% ish as server activity.

I have run Ethereal and during a hang there doesnt seem to anything standout with the log/trace.

The performance at all other times is extremely fast.

The hosts are all dual fiber SAN attached 2003 windows storage servers accessing EVA8000 (synchronous CA).

This is proving to be an extremely annoying difficult problem to solve and pin down issue - if I had any hair i would be tearing it out!!!!

User base is 5000, 53 mount points across those four servers equating to around 8TB.

Heres hoping someone can help.
0
Comment
Question by:eysanadmin
  • 4
  • 3
10 Comments
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 17029420
You should not be running close to 98% on any server, this is obviously a lot of the problem.  Have you installed switches to redirect users to the appropriate area of the network, or are they all bottlenecking through one authentication server?  If so, that is the problem.  Decentralize is the answer -- and keep that server activity below 70% for anything more than 1-5 seconds.
0
 

Author Comment

by:eysanadmin
ID: 17030115
Hi the server is not at 98% utilization , it is simply that when using vital suite to view the transaction when opening a file from client to host it shows that 98% of the transaction is host/server based not client.

So if opening a file take 45 seconds or so most of that is happening on the server the server itself has low utilisation when viewing through T.Manager.

0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 17033139
OK, that is more like it.  I think you have a server traffic bottleneck, as intimated by --
"Users access their shares via login scripts and are mapped to their business unit share"

All users are going through one authentication server, correct?  So even when they are farmed off to their business share unit, they are still routing through the AUTH server.  Can you compare the CPU utilization on this server at the same time as the server hosting the users data on the SAN array? (i.e. when the slow down occurs).  If so, I think you will find the AUTH server is the bottleneck, not the SAN delivery to the host account.

If you switched segments of your population through different servers, you can avoid this issue.  Is that possible in your topology?
0
 

Author Comment

by:eysanadmin
ID: 17048998
We apparently have 5 AD servers that users are directed to within our environment my understanding is that once the user has logged on the authentication is done.

The delays that occur can happen at any time during the time they are connected, here is a SS excerpt during one of these "Hangs" as you can see all I am doing is opening a folder within a mapped share.



No.     Time        Source                Destination           Protocol Info
    343 14.824508   10.64.164.73          10.64.211.28          SMB      Trans2 Request, QUERY_PATH_INFO, Query File Basic Info, Path: \Data\GFIS\GL_Files
    344 14.907006   10.64.211.28          10.64.164.73          TCP      microsoft-ds > 3030 [ACK] Seq=41701 Ack=11625 Win=65417 Len=0
    386 24.544598   10.64.211.28          10.64.164.73          SMB      Trans2 Response, QUERY_PATH_INFO
    387 24.572290   10.64.164.73          10.64.211.28          SMB      Trans2 Request, FIND_FIRST2, Pattern: \Data\GFIS\GL_Files\*
    388 24.599779   10.64.211.28          10.64.164.73          SMB      Trans2 Response, FIND_FIRST2, Files: . .. GLTR_GLOBE_GB_02_01_20040826.DAT.gz GLTR_GLOBE_GB_03_01_20040826.DAT.gz GLTR_GLOBE_GB_01_01_20040825.DAT.gz GLTR_GLOBE_GB_02_01_20040825.DAT.gz GLTR_GLOBE_GB_03_01_20040825.DAT.gz GLTR_GLOBE_GB_01_01_20040826.DAT.gz


As you can see after an immediate Ack from the server back to the client, there is a 10 second delay before a response is given back to the host.

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 44

Expert Comment

by:scrathcyboy
ID: 17049071
Yes, I see that.  From 14.9 to 24.5 is a HUGE delay.  This is just to find the users path to his home dir or a shared path on (presumably) another server linked to the SAN.  So there are 2 possibiliites here --

(1) the server you are going through is getting a 10 second delay from the other AD server where the host file path resides -- in which case, you need to investigate the switches or AD response from this to the other server, test them for response or latency.  I assume your network cables are top line, so I suspect this is less likely to be the problem.

(2) the server through which the request is going has maxed out, and is peaking at 100% usage for those 10 seconds.  In this regard, the assumption "once the user has logged on the authentication is done" is correct as far as authentication goes, but his traffic is still going through this server to get to the rest of the network, and if 5000 peoples traffic is still going through that same server, it will create a bottleneck.  

Let me ask another way, what system do you have in place, be it a switch or an AD server, to route each persons requests directly to the server where their account and files lie, after authentication?  You can test this by a traceroute.  Say that user 1 needs files from the SAN served by server 5, so if you go to user 1s computer, and do a tracert to the IP address of the server 5 where his files are accessed, do you see the IP of server 1 (the AUTH server) in the tracert?  If so, the requests for all packets are still going through the original authentication server, even though authentication is done.

I hope this makes sense and gives you an idea of what to look for.  In debugging bottlenecks, you need to get a grasp on the overall traffic flow on the network, and this is best done by someone who can take the time to debug where the traffic is going, and assess the volume load per server on the network.
0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 17597171
I am surprised the questioner didnt close this.  It is worth staying in EE DB, dont you think?
0
 

Author Comment

by:eysanadmin
ID: 17599116
Hi   Sorry for not coming back sooner.

It turned out to be the DLM (Cluster communication) network, there were some parameters on startup built in to throttle back performance for slower networks one we had those changed to allow maximum throughput this fixed the issue with the cluster.

This was all based on a Polyserve product and was in the end resolved by them.

ty for all the help
0
 

Accepted Solution

by:
CetusMOD earned 0 total points
ID: 17682062
PAQed with points refunded (500)

CetusMOD
Community Support Moderator
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

This article is an update and follow-up of my previous article:   Storage 101: common concepts in the IT enterprise storage This time, I expand on more frequently used storage concepts.
Create your own, high-performance VM backup appliance by installing NAKIVO Backup & Replication directly onto a Synology NAS!
This tutorial will walk an individual through the steps necessary to enable the VMware\Hyper-V licensed feature of Backup Exec 2012. In addition, how to add a VMware server and configure a backup job. The first step is to acquire the necessary licen…
This tutorial will walk an individual through the steps necessary to install and configure the Windows Server Backup Utility. Directly connect an external storage device such as a USB drive, or CD\DVD burner: If the device is a USB drive, ensure i…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now