Re: [PLUG] packet loss weirdness

On Thu, Jun 13, 2013 at 12:45 PM, <bergman@merctech.com> wrote:

Recently I began seeing network problems on a particular server. Simple
tests consistently reported that the machine was 'up', but NFS
clients would fail to mount volumes and connections to existing mounts
were slow, and there were increasing instances of HA cluster fencing
due to network unreliability.

Here's the basic network information for srv1 (addresses have been
obfuscated to protect the innocent):

eth0: 172.108.123.19 "public" to campus network
eth1: 192.168.111.19 private admin network w/in racks

There was approximately 70% packet loss on 172.108.123.0/24. The packet
loss was symmetrical (when pinging srv1 from other devices on the same
switch, or when pinging other devices on the same switch from srv1). The
packet loss was consistent over extended periods (hours). The packet
loss was consistent when the server was basically idle.

There was no packet loss on eth1 (192.168.111.0/24).

Packet loss was persistent across reboots.

This was beginning to look like a cable problem...maybe the cable got
pinched while closing to doors on the rack....I swapped cables. The
packet loss was persistent after changing the network cable.

Hmmm....maybe the switch port is flakey....replug the eth0 interface from
the server into a different port on the same switch...the packet loss
was persistent after changing switch port.

There was nothing immediately obvious from entries in /var/log/messages
or dmesg.

The MTU on the eth0 interface was correct.

The interface speed, duplex, and flow-control settings were correct.

The hardware diagnostics (Dell's OpenManage utils) reported no faults.

Next, I logically changed IP settings for eth0 and eth1 (assigning the
192.168.111.19 address to eth0 and 172.108.123.19 to eth1) and swapped
cables so that:

eth0 => administrative switch
eth1 => "public" campus switch

I expected that the eth0 physical device would continue to show packet
loss, even though traffic was now on the 192.168.111.0 network.

Now, sit back and get ready for the weirdness:

Packet loss persists at ~70% on 172.108.123.0/24 even though the
physical connection is now on eth1.

I next disabled eth0/eth1 interfaces on the motherboard, added a new
PCI NIC, and assigned the previous addresses to ports on the new NIC. Now
there is no packet loss on either the public or admin network.

Any guesses about what was happening?

Thanks,

Mark "perplexed in Philly" Bergman
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

--
John Kreno

"Those who would sacrifice essential liberties for a little temporary safety deserve neither liberty nor safety." - Ben Franklin