Tim Dodd on 14 Oct 2007 19:36:55 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] network/server troubleshoot


Is this a production server?  If so we could colo the server for you and manage the firewall in a n+1 reduntant datacenter.  We can deploy within 24 hours and have you up in running.

Tim dodd


----- Original Message -----
From: plug-bounces@lists.phillylinux.org <plug-bounces@lists.phillylinux.org>
To: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Sent: Sun Oct 14 15:25:38 2007
Subject: [PLUG] network/server troubleshoot

I've been having an intermittent problem with my firewall server and/or Internet
connection.  Unfortunately, I don't have the time to spare to "tinker" with it
and I'm not a network expert either.  I'm hoping someone here has some insight
because my current favorite solution involves blasting caps and some mixtures
better left unmentioned :-) [that's a joke to express my frustration BTW]

Background:  Firewall is a SME server/CentOS based system with 2 nics.  eth0 is
the Internet and eth1 is the LAN.

The system is running djbdns tools (dnscache and tinydns) but they appear
blameless AFAIK.  I did set it up to use opendns.com rather than my ISP
(Cavalier DSL) but this changed nothing - the problem persisted.

Frequently the Internet connection just ceases to work properly.  It may fix
itself after some indeterminate time.  Here is what I observe:

( for all of the following I am logged in as root on the firewall )

1.  When it does not work (no traffic appears to go in or out) and I type
    ping www.google.com I get the message: ping: unknown host www.google.com

2.  Fetchmail complains like this:

      fetchmail: awakened at Sun Oct 14 09:32:39 2007
      fetchmail: Query status=2 (SOCKET)
      fetchmail: timeout after 300 seconds waiting to connect
                    to server pop.gmail.com.
      fetchmail: socket error while fetching from pop.gmail.com

3.  I can "fix" this situation by entering the following commands (which I have
    combined into a script called "toggle":

     #!/bin/bash
     /sbin/ifdown eth0
     sleep 3
     /sbin/ifup eth0

4.  To log the problem and temporarily "deal" with it I created a script
    called doody and put it in the root cron to run every minute.
    (You can guess the reason for the name)

     #!/bin/bash
     /bin/ping -W 10 -c 1 www.google.com  >/dev/null
     if [ "$?" == "0" ]
     then
         echo -n '.'
     else
         echo ''
         echo -n 'trouble: '
         date
         /root/bin/toggle
     fi

     Okay, it's stupid but it works temporarily and the outages don't last
     more than a minute this way :-P

     DESPERATION, not necessity, is the mother of invention.

5.   There are no relevant messages in /var/log/messages when it fails.

6.   When I "toggle" the eth0 interface I sometimes see this in
     /var/log/messages:

        Oct 14 13:41:18 polaris kernel: eth0: Setting full-duplex
             based on MII#1 link partner capability of 01e1.

     less frequently the above link is preceded by:

        Oct 14 15:03:12 polaris kernel:
              0000:01:01.0: tulip_stop_rxtx() failed

     Google search on "tulip_stop_rxtx" and failed yields a bunch of
     useless comments from the kernel list.  Bad news IMHO but I don't
     know what to do about it other than swap out the tulip-based nics.

     Here, for example, is the output of a few hours of doody.log - the
     output from the doody naturally (every period represents a minute
     without a problem.)  You can see the frequency of the interruptions:

        trouble: Sun Oct 14 09:59:11 EDT 2007
        .......................................................
        trouble: Sun Oct 14 10:55:11 EDT 2007
        ...................
        trouble: Sun Oct 14 11:15:11 EDT 2007
        ..............
        trouble: Sun Oct 14 11:30:11 EDT 2007
        .........
        trouble: Sun Oct 14 11:40:11 EDT 2007
        ............
        trouble: Sun Oct 14 11:53:11 EDT 2007
        .........................
        trouble: Sun Oct 14 12:19:11 EDT 2007
        .......................................................
        trouble: Sun Oct 14 13:15:11 EDT 2007
        .........................
        trouble: Sun Oct 14 13:41:11 EDT 2007
        ...
        trouble: Sun Oct 14 13:45:11 EDT 2007
        ....................
        trouble: Sun Oct 14 14:06:11 EDT 2007


My biggest problem is that I don't know how or where to get more information for
troubleshooting this.  It's almost worth the trouble to just replace all the
nics and reconfigure the system.  If I knew that would fix it I would do that
ASAP.

Advice appreciated!

Eric
--
#  Eric Lucas
#
#                "Oh, I have slipped the surly bond of earth
#                 And danced the skies on laughter-silvered wings...
#                                        -- John Gillespie Magee Jr
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug