How to troubleshoot a network for performance issues

Kenny537
Kenny537 used Ask the Experts™
on
Generally speaking, I want to come up with a high level checklist of things to check if there are issues on a network.  

What steps would I follow to troubleshoot network issues?

Thanks.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2014

Commented:
That is a BIG open question.

First thing I look at is trying to isolate the area perceived network performance problem.

Is it a single server that people are using that seems to have the problem?

Is it a group of servers that may share the same switch?

Is it  a single user?

Is it a group of users that share the same switch?

Can it be isolated to a specific link/port?

Once that is done start looking at the area you have isolated the problem to.   Verify that it is not a server/desktop performance issue.

Unless you have a small network typically the "whole" network is not affected.  If you have a fairly large network and it appears that everything is affected, then you start looking at the core of the network.

Author

Commented:
I will try to answer your questions, but first could you tell me if there is some utility that I can download to analyze network performance?
Top Expert 2014

Commented:
Are you having a problem right now that you are trying to figure out?

If your switches and routers have SNMP enabled, you can get MRTG (works best on Linux) that will poll them every 5 minutes and tell you the port/link utilization.  There are other products like MRTG (PRTG) that may work better on Windows.

How many servers do you have?

How many switches?
Exploring ASP.NET Core: Fundamentals

Learn to build web apps and services, IoT apps, and mobile backends by covering the fundamentals of ASP.NET Core and  exploring the core foundations for app libraries.

Author

Commented:
Okay - I got some information for you.

This is supposed to be a very high level thing.  There are no issues at this point - it is more of a proactive approach for a plan of action.

It is not for any specific network, rather a general checklist.  It will be used as a troubleshooting guide as part of an operations document.

Whenever someone reports a network issue, this will be a checklist that we will run for troubleshooting .. as a starting point.

Does that help?
Top Expert 2014

Commented:
The the generic type question I listed earlier are a good place to start.

Asking the user things like:

What are you doing?

What would you say your "normal" response time is?

What is your current response time?

Do you experience this slowness when you access "application X" or "server Y"?  (where application X and server Y is an application or a server that is differ net fro the application/server they are reporting the problem against).

Is anybody else in your office/area experiencing the same symptoms doing the same thing?

Is anybody else in your office/area experiencing the same symptoms accessing a different application/server?

Then also tracking calls from other users that may be located in different areas/locations.  By area I mean like the 4th floor south corner vs. the 3rd floor the north corner of the same building.

The question above are to help try and isolate the problem and they will help point out other non-network related problems that show a common symptom.  

We had a problem once where we had every user reporting performance issue and it was spread across 8-10 servers, we have over 100 servers.  The users were from all over our building, the servers were spread across 4 difference access switches.   After getting our server group involved we found that the servers all shared the same SAN.  Problem was that one of the controllers on the SAN died and somebody accidentally triggered a backup during the middle of the day.  

By asking the above questions and knowing what was connected where and how we were able to identify that the problem was not networking and turned it over to the server group.

Author

Commented:
Thanks for the helpful information.  
I think those questions are certainly relevant to ask the user, and they will help me in that respect.

I understand that you are saying you must first narrow down the problem by asking the user some questions in order to drill down.  But is there no generic overall checklist/tests/utilities to run to troubleshoot the network itself?
Top Expert 2014

Commented:
Define "the network".  

I'm not sure about your environment, but in our network we two core L3 switches with over 100 ports each that then we have 30 internal access switches (with 48 ports each) for our internal network, 10 blade chassis with two switches in each, 4 firewalls, 3 application load balancers, 2 edge switches that connect to 2 ISP managed routers to the Internet, about 15 routers for customer provided dedicated IP links into our network, two mainframes and over 150 servers each with at least two network connections.  Over all close to 2,000 network ports.

This is what I consider a small network.  There is no single test that you could do.

Now there are products, we use SolarWinds Orion, that monitor various resources and we will get alerts if it notices something.   It montiors ping times to all resource, link/port utiltization on links/ports we consider critical, CPU and memory utilization on resources we consider critical.  However, all of this is done so we can isolate where to look first.

However, from a human point of view, the first thing is to isolate were you think the problem is.  Proactive monitoring using something like SolarWinds and asking users the correct questions help with this.  If you don't do either, then you are stuck going to each device in your network and looking at their management interface to see if something looks wrong.

Author

Commented:
Thanks - that helps a lot.

II received some feedback.  I don't think even the general network matters at all at this point.  This is still very preliminary where I should be asking questions like you said in your previous post like what the user is doing, response time, etc.

I think they want something very standardized - I believe they even mentioned something like what the Department of Defense does .. not really sure what that is all about.

It makes sense to start on the desktop side first, and then move on to the server side.  So along with those questions you mentioned - to start off, here are some other ones I can think of are:

"Is it a desktop, laptop, or a server?"
"Are the lights on?  Is the network powered?"
"Are the wireless radios on?"  or "Is the Ethernet cord plugged in securely?"
"Can you go to google.com?"  "Can you go to an intranet website?"  (internal vs external network problem)
"Can you do a speed/bandwidth test from CNET?"

Would these be classified as level 1 help scripts?

Can anyone think of an other questions I can ask?  Or link me to a good resource for this type of thing?

Thanks.
Steve JenningsSr Manager Cloud Networking Ops
Commented:
Yes, here's a generic check list:

Verify physical
Verify data link
Verify network
Verify transport
Verify session
Verify presentation
Verify application

Start at the bottom of the protocol stack and work up!

Good luck,
Steve

Author

Commented:
So start at application?  You would not start at physical?

Thanks.

Author

Commented:
Also, does anyone have anything more specific?  I've been googling a lot but I fail to find anything.  Although I gues that this probably isn't something that I'd find on google and my best bet is to talk to network administrators/help desk people.
Top Expert 2014
Commented:
The problem is you are asking for specific question to trouble shoot a generic situation.  There are none.

You need to start a a generic description of what the user's understanding/impression of what the the problem is.

Typically this means:

1) What are you attempting to do?
2) What do you expect the results to be?
3) What are the results.

To go any further, you would need to need to understand what they are attempting to do and what it takes to perform that function.  The question you ask dealing with trying to put a complex formula in to a Excel spread sheet are different from attempting to launch a .NET program.
Steve JenningsSr Manager Cloud Networking Ops

Commented:
Kenny537 . . . .

Sorry, unfortunate placement of the list and comment. "Start at the bottom . . . " not of my list but of the protocol stack, which, you are correct, is physical. Start at the top of my list . . . which is the bottom of the protocol stack.

Ha

Good luck,
SteveJ
nociSoftware Engineer
Distinguished Expert 2018
Commented:
First try to get a CLEAR description of the problem involved..
Then try to make a picture (mental of literal drawing on paper/whiteboard) of everything, yes even the cables & types, involved.
THe picture includes
Applications,  (both the server and the client, if they are multiple processes also their interconnections)
Protocol stacks, (From all layers)
logical connections, (routes through IP networks (Layer 3), Spanning tree on Layer2), etc. etc.)
physical connections, (where does what cable run, interface types & models)
even state of parts involved.
Besides this collect a description of how working parts function and in what ways they can fail...

Example:
Cable:
Function: conducts energy between a transmitter & receiver.
Ethernet application:
 * Coax- X-BaseY (10Base5, 10Base2 etc.) there is one medium ==> Half duplex
 * UTP & STP have a separate conducting line from the transmitter at one end to a receiver at the other end.
     This allows for Full Duplex. Simultanous Transmission at both ends.
 * Optic fibre... (conduct light not current),
Failures: Cables too long, too many connectors involved, Broken cables, for xBaseY distance between connections is too short.
              Cables too short (optic fibre) which causes too much light on the receiver.
     - Auto negotiation sometimes fail, as certain equipment assumes FD when not negotiated, and other equipment assumes HD when not negotiated.
       (especialy on 10 & 100Mbps connections).  This causes a communication breakdown when there is havy traffic, works un light traffic conditions...


Then you can try to eliminate from the picture everything that is definitely not involved (because of state: disabled, turned off, blocked, bypassed ).
Also be prepared for some messy details: your server has net work performance trouble to find out that the disks are iSCSI and during ahvy traffic you actualy provoke a lot of IO... ) You may need to try to keep a high level view available besides all gory details.

Then you can by process of ellimination try to remove components & states that clearly work. (used by other parts that don't have problems)
Or test that certain assumption about a component are still valid (look at settings on ports...)

The place to actualy start where the trouble might be... should emerge from the picture (what's left of it).

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial