Solved

MTR packet loss analysis

Posted on 2013-02-03
7
1,294 Views
Last Modified: 2013-03-10
Hi,

After some insight into the following MTR trace (host names obscured to protected the guilty ISP)

HOST: servername                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                                  0.1%  1500    0.7   1.6   0.4 181.6  11.7
  2. some.router.provider.net                10.9%  1500    0.9   1.1   0.6 210.5   7.4
  3. target.server.com                        0.0%  1500    1.1   0.8   0.6   3.0   0.2

Open in new window


We have two servers with an ISP and occasionally have timeouts between them when requesting data from server-backend on server-frontend.

The application logs show all is fine with the backend server but the frontend logs show a timeout over the network connection within the timeframe of the mtr trace.

Anyway, we run mtr in continuous batches of 1500 packets to try and discover where the dropouts are occurring and managed to catch the above output.

To me, this indicates an issue with some.router.provider.net either with a fault or dropping mtr packets due to load.  Either way, its under load.

The ISP is saying this proves nothing because the last hop is showing no packet loss.

The question is, what is mtr actually showing here and is it useful or not in trying to determine why the end to end network timeout is happening?

Thanks
BT
0
Comment
Question by:brothertom
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
7 Comments
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 38857034
It depends on network topology. Is some.router.provider.net within the ISP? Can you post a diagram? (Ascii art will do)
0
 

Author Comment

by:brothertom
ID: 38857078
yes, within ISP network, both servers being at same ISP but on different networks

server1 > isp router > server2
0
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 38858436
So are you saying there is zero loss server1 <==> server2 but 10% loss server1 <==> router?
0
Ready to trade in that old firewall?

Whether you need to trade-up to a shiny new Firebox or just ready to upgrade from whatever appliance you're using now, WatchGuard has the right appliance for you! Find your perfect Firebox today with appliance sizing tool!

 

Author Comment

by:brothertom
ID: 38858860
That is what it seems to be showing.
0
 
LVL 35

Assisted Solution

by:Duncan Roe
Duncan Roe earned 500 total points
ID: 38861502
I really don't think the tool is helping you a lot. End-to-end response is what counts, and MTR reports it is fine. But at the application level you are experiencing timeouts - that is what you said right?
You need to run tcpdump or your favorite tool to determine whether the timeouts are associated with tcp retries. If not, the problem is at a higher level
0
 

Author Comment

by:brothertom
ID: 38872517
We monitor (via Nginx logs) the time taken for each call to the backend.
Generally we're looking at 4-6ms but during the times when MTR is showing timeouts in the middle of the route, we either get 200-2000ms shown or complete failure.

This would appear to indicate that the slow/failure is due to congestion on the network and according to the MTR trace, this would also appear to be at this middle routing stage.

Although the tcpdump tool is a good idea, these timeouts only occur for a few minutes every 2/3 weeks, but tricky to capture, unless we are able to setup tcpdump to run continuously, but only save stuff that is taking a long time.  Sounds like this will load the machine up quite a bit.
0
 
LVL 35

Accepted Solution

by:
Duncan Roe earned 500 total points
ID: 38873726
How about running tcpdump -w output_file -C? (The file names are wrong way round for logrotate so you need to clean them up manually).
0

Featured Post

Cloud Training Guides

FREE GUIDES: In-depth and hand-crafted Linux, AWS, OpenStack, DevOps, Azure, and Cloud training guides created by Linux Academy instructors and the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article is in response to a question (http://www.experts-exchange.com/Networking/Network_Management/Network_Analysis/Q_28230497.html) here at Experts Exchange. The Original Poster (OP) requires a utility that will accept a list of IP addresses …
PRTG Network Monitor lets you monitor your bandwidth usage, so you know who is using up your bandwidth, and what they're using it for.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
Michael from AdRem Software outlines event notifications and Automatic Corrective Actions in network monitoring. Automatic Corrective Actions are scripts, which can automatically run upon discovery of a certain undesirable condition in your network.…
Suggested Courses

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question