Re: [PLUG] Basic network monitoring and link quality software

Hey, this is very interesting!

Here's the output for the first run:

mtr --report -c3 www.google.com
Start: Sat Jul 11 11:31:25 2015
HOST: saturn Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.10.10.1 0.0% 3 1.2 2.0 1.2 3.8 1.4
2.|-- L100.PHLAPA-VFTTP-71.veri 0.0% 3 6.3 33.9 6.3 87.2 46.2
3.|-- G102-0-0-16.PHLAPA-LCR-21 0.0% 3 9.5 28.5 9.5 55.7 24.2
4.|-- ae5-0.PHIL-BB-RTR1.verizo 0.0% 3 57.6 28.0 9.2 57.6 25.9
5.|-- 0.xe-7-0-2.XL1.IAD8.ALTER 0.0% 3 47.9 25.9 12.5 47.9 19.2
6.|-- 0.xe-8-2-0.GW9.IAD8.ALTER 0.0% 3 10.7 55.5 10.7 143.9 76.6
7.|-- google-gw.customer.alter. 0.0% 3 51.3 81.0 47.0 144.7 55.2
8.|-- 209.85.252.46 0.0% 3 69.1 110.3 69.1 182.8 63.0
9.|-- 209.85.143.112 0.0% 3 17.2 45.1 16.5 101.5 48.9
10.|-- 216.239.40.209 0.0% 3 23.0 23.9 16.6 32.1 7.7
11.|-- 72.14.236.227 0.0% 3 98.2 89.1 25.7 143.5 59.4
12.|-- 209.85.250.7 0.0% 3 26.3 33.1 26.3 45.6 10.8
13.|-- yyz08s14-in-f19.1e100.net 33.3% 3 115.2 71.7 28.2 115.2 61.5

about 3 minutes later...

Start: Sat Jul 11 11:34:30 2015
HOST: saturn Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.10.10.1 0.0% 3 1.4 1.4 1.2 1.6 0.0
2.|-- L100.PHLAPA-VFTTP-71.veri 0.0% 3 4.8 5.8 4.8 6.4 0.7
3.|-- G0-9-3-2.PHLAPA-LCR-22.ve 0.0% 3 10.3 11.1 10.2 12.9 1.4
4.|-- ae6-0.PHIL-BB-RTR2.verizo 0.0% 3 5.9 33.5 5.9 85.5 45.1
5.|-- 0.xe-11-1-1.XL2.IAD8.ALTE 0.0% 3 13.4 12.7 11.4 13.4 0.7
6.|-- 0.xe-9-1-0.GW9.IAD8.ALTER 0.0% 3 13.8 11.5 10.0 13.8 1.9
7.|-- google-gw.customer.alter. 0.0% 3 19.0 19.4 15.3 23.8 4.2
8.|-- 209.85.252.80 0.0% 3 14.6 15.5 12.5 19.4 3.5
9.|-- 72.14.236.152 0.0% 3 14.1 13.3 12.8 14.1 0.7
10.|-- 216.239.40.159 0.0% 3 16.2 16.0 15.3 16.4 0.0
11.|-- 72.14.236.225 0.0% 3 28.1 30.0 27.2 34.9 4.1
12.|-- 72.14.239.19 0.0% 3 28.4 27.9 27.2 28.4 0.0
13.|-- yyz08s09-in-f18.1e100.net 0.0% 3 31.5 29.7 27.0 31.5 2.2

So, in analyzing my Verizon FiOS, should I only concerned about rows #2 #3, and #4?
Row 4 in the second run seems suspicious (85.5 for worst?)

A couple things to be cautious with when looking at mtr, ping traceroute numbers.

First. The mtr command you’re running is only doing 3 packets. It may not give you a great idea of what’s going on with a sustained flow of data. At work we like to beat the crap out of things using mtr by simulating something “like” a voice RTP flow. e.g.

mtr -s 200 -i 0.020 -c 3000 --report $destination

That will send a ton of test packets 200 bytes at 20ms intervals. (That’s close to what RTP traffic looks like in a voice application). And 3000 test packets will take 60s to run (still a short sampling period)

The second thing to be cautious of is the average latency from a single router on the way is going to be hard to reason about any problems. Many router vendors de-prioritize the responding to ICMP packets. I’ve had a set of brand new very expensive Junipers in a lab with no traffic running across them except for a small amount of mtr/ping traffic and I was showing packet loss and bad latency on the Juniper devices but my end to end numbers to the other test host on the same lab network were excellent.

Third, the main thing you can learn from mtr is if a problem is being introduced where it’s starting. And you’ll see that like follows. If your average latency on hop 4 jumped up to 150ms, but you saw a similar increase in every hop between you and the other end, including the device you are testing to. Then you can reason that there is an increase in latency which starts between hops 3 and 4.

This same reasoning can be used for packet loss. If you see 50% packet loss on a hop in the middle of your mtr trace but no significant loss further along the chain then there is no reason to worry about it.

Fourth, keep in mind you are measuring round-trip time and packet loss on round trips. It is not uncommon for routes on the Internet to be asymmetrical. That means your packet loss and latency numbers are not only measuring the path forward you can see from the mtr, but the return path that you can’t see. The only way to get an idea of the return path is to get a trace from the other side back toward you.

Fifth, flow based load balancing across redundant can screw up your return path trace to show a different path back to you than your forward trace is seeing.

With all those caveats, I think mtr is an excellent tool and use it daily, or at least daily when pulled into any sort of question of packet-loss, network performance. I especially like running it in non-report mode (interactive mode). It’s curses based so hit question mark and see what other modes you can use to look at live updating results while a test is being performed. I like hitting ‘j’ to see their jitter numbers and a drop counter.

So my assessment of the 85ms bad ping response from hop 4. I’d say no problem at all. You’ll see hops 3 and 4 are different in your second trace. Also so is your destination. Remember www.google.com isn’t a host but a DNS lookup that will result in many possible hosts.

And even if you trace to the same IP address to you could end up in radically different paths and even data-centers/hosts especially when dealing with a company like google who is excellent at service availability and probably algorithmically re-routing stuff.

Hope all this is helpful.