Solved

MTR packet loss analysis

Posted on 2013-02-03
7
1,227 Views
Last Modified: 2013-03-10
Hi,

After some insight into the following MTR trace (host names obscured to protected the guilty ISP)

HOST: servername                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                                  0.1%  1500    0.7   1.6   0.4 181.6  11.7
  2. some.router.provider.net                10.9%  1500    0.9   1.1   0.6 210.5   7.4
  3. target.server.com                        0.0%  1500    1.1   0.8   0.6   3.0   0.2

Open in new window


We have two servers with an ISP and occasionally have timeouts between them when requesting data from server-backend on server-frontend.

The application logs show all is fine with the backend server but the frontend logs show a timeout over the network connection within the timeframe of the mtr trace.

Anyway, we run mtr in continuous batches of 1500 packets to try and discover where the dropouts are occurring and managed to catch the above output.

To me, this indicates an issue with some.router.provider.net either with a fault or dropping mtr packets due to load.  Either way, its under load.

The ISP is saying this proves nothing because the last hop is showing no packet loss.

The question is, what is mtr actually showing here and is it useful or not in trying to determine why the end to end network timeout is happening?

Thanks
BT
0
Comment
Question by:brothertom
  • 4
  • 3
7 Comments
 
LVL 34

Expert Comment

by:Duncan Roe
ID: 38857034
It depends on network topology. Is some.router.provider.net within the ISP? Can you post a diagram? (Ascii art will do)
0
 

Author Comment

by:brothertom
ID: 38857078
yes, within ISP network, both servers being at same ISP but on different networks

server1 > isp router > server2
0
 
LVL 34

Expert Comment

by:Duncan Roe
ID: 38858436
So are you saying there is zero loss server1 <==> server2 but 10% loss server1 <==> router?
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:brothertom
ID: 38858860
That is what it seems to be showing.
0
 
LVL 34

Assisted Solution

by:Duncan Roe
Duncan Roe earned 500 total points
ID: 38861502
I really don't think the tool is helping you a lot. End-to-end response is what counts, and MTR reports it is fine. But at the application level you are experiencing timeouts - that is what you said right?
You need to run tcpdump or your favorite tool to determine whether the timeouts are associated with tcp retries. If not, the problem is at a higher level
0
 

Author Comment

by:brothertom
ID: 38872517
We monitor (via Nginx logs) the time taken for each call to the backend.
Generally we're looking at 4-6ms but during the times when MTR is showing timeouts in the middle of the route, we either get 200-2000ms shown or complete failure.

This would appear to indicate that the slow/failure is due to congestion on the network and according to the MTR trace, this would also appear to be at this middle routing stage.

Although the tcpdump tool is a good idea, these timeouts only occur for a few minutes every 2/3 weeks, but tricky to capture, unless we are able to setup tcpdump to run continuously, but only save stuff that is taking a long time.  Sounds like this will load the machine up quite a bit.
0
 
LVL 34

Accepted Solution

by:
Duncan Roe earned 500 total points
ID: 38873726
How about running tcpdump -w output_file -C? (The file names are wrong way round for logrotate so you need to clean them up manually).
0

Featured Post

Control application downtime with dependency maps

Visualize the interdependencies between application components better with Applications Manager's automated application discovery and dependency mapping feature. Resolve performance issues faster by quickly isolating problematic components.

Join & Write a Comment

Transparency shows that a company is the kind of business that it wants people to think it is.
Quality of Service (QoS) options are nearly endless when it comes to networks today. This article is merely one example of how it can be handled in a hub-n-spoke design using a 3-tier configuration.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now