• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1305
  • Last Modified:

MTR packet loss analysis

Hi,

After some insight into the following MTR trace (host names obscured to protected the guilty ISP)

HOST: servername                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                                  0.1%  1500    0.7   1.6   0.4 181.6  11.7
  2. some.router.provider.net                10.9%  1500    0.9   1.1   0.6 210.5   7.4
  3. target.server.com                        0.0%  1500    1.1   0.8   0.6   3.0   0.2

Open in new window


We have two servers with an ISP and occasionally have timeouts between them when requesting data from server-backend on server-frontend.

The application logs show all is fine with the backend server but the frontend logs show a timeout over the network connection within the timeframe of the mtr trace.

Anyway, we run mtr in continuous batches of 1500 packets to try and discover where the dropouts are occurring and managed to catch the above output.

To me, this indicates an issue with some.router.provider.net either with a fault or dropping mtr packets due to load.  Either way, its under load.

The ISP is saying this proves nothing because the last hop is showing no packet loss.

The question is, what is mtr actually showing here and is it useful or not in trying to determine why the end to end network timeout is happening?

Thanks
BT
0
brothertom
Asked:
brothertom
  • 4
  • 3
2 Solutions
 
Duncan RoeSoftware DeveloperCommented:
It depends on network topology. Is some.router.provider.net within the ISP? Can you post a diagram? (Ascii art will do)
0
 
brothertomAuthor Commented:
yes, within ISP network, both servers being at same ISP but on different networks

server1 > isp router > server2
0
 
Duncan RoeSoftware DeveloperCommented:
So are you saying there is zero loss server1 <==> server2 but 10% loss server1 <==> router?
0
SMB Security Just Got a Layer Stronger

WatchGuard acquires Percipient Networks to extend protection to the DNS layer, further increasing the value of Total Security Suite.  Learn more about what this means for you and how you can improve your security with WatchGuard today!

 
brothertomAuthor Commented:
That is what it seems to be showing.
0
 
Duncan RoeSoftware DeveloperCommented:
I really don't think the tool is helping you a lot. End-to-end response is what counts, and MTR reports it is fine. But at the application level you are experiencing timeouts - that is what you said right?
You need to run tcpdump or your favorite tool to determine whether the timeouts are associated with tcp retries. If not, the problem is at a higher level
0
 
brothertomAuthor Commented:
We monitor (via Nginx logs) the time taken for each call to the backend.
Generally we're looking at 4-6ms but during the times when MTR is showing timeouts in the middle of the route, we either get 200-2000ms shown or complete failure.

This would appear to indicate that the slow/failure is due to congestion on the network and according to the MTR trace, this would also appear to be at this middle routing stage.

Although the tcpdump tool is a good idea, these timeouts only occur for a few minutes every 2/3 weeks, but tricky to capture, unless we are able to setup tcpdump to run continuously, but only save stuff that is taking a long time.  Sounds like this will load the machine up quite a bit.
0
 
Duncan RoeSoftware DeveloperCommented:
How about running tcpdump -w output_file -C? (The file names are wrong way round for logrotate so you need to clean them up manually).
0

Featured Post

Evaluating UTMs? Here's what you need to know!

Evaluating a UTM appliance and vendor can prove to be an overwhelming exercise.  How can you make sure that you're getting the security that your organization needs without breaking the bank? Check out our UTM Buyer's Guide for more information on what you should be looking for!

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now