bergman on 13 Jun 2013 10:11:44 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
[PLUG] packet loss weirdness |
Recently I began seeing network problems on a particular server. Simple tests consistently reported that the machine was 'up', but NFS clients would fail to mount volumes and connections to existing mounts were slow, and there were increasing instances of HA cluster fencing due to network unreliability. Here's the basic network information for srv1 (addresses have been obfuscated to protect the innocent): eth0: 172.108.123.19 "public" to campus network eth1: 192.168.111.19 private admin network w/in racks There was approximately 70% packet loss on 172.108.123.0/24. The packet loss was symmetrical (when pinging srv1 from other devices on the same switch, or when pinging other devices on the same switch from srv1). The packet loss was consistent over extended periods (hours). The packet loss was consistent when the server was basically idle. There was no packet loss on eth1 (192.168.111.0/24). Packet loss was persistent across reboots. This was beginning to look like a cable problem...maybe the cable got pinched while closing to doors on the rack....I swapped cables. The packet loss was persistent after changing the network cable. Hmmm....maybe the switch port is flakey....replug the eth0 interface from the server into a different port on the same switch...the packet loss was persistent after changing switch port. There was nothing immediately obvious from entries in /var/log/messages or dmesg. The MTU on the eth0 interface was correct. The interface speed, duplex, and flow-control settings were correct. The hardware diagnostics (Dell's OpenManage utils) reported no faults. Next, I logically changed IP settings for eth0 and eth1 (assigning the 192.168.111.19 address to eth0 and 172.108.123.19 to eth1) and swapped cables so that: eth0 => administrative switch eth1 => "public" campus switch I expected that the eth0 physical device would continue to show packet loss, even though traffic was now on the 192.168.111.0 network. Now, sit back and get ready for the weirdness: Packet loss persists at ~70% on 172.108.123.0/24 even though the physical connection is now on eth1. I next disabled eth0/eth1 interfaces on the motherboard, added a new PCI NIC, and assigned the previous addresses to ports on the new NIC. Now there is no packet loss on either the public or admin network. Any guesses about what was happening? Thanks, Mark "perplexed in Philly" Bergman ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug