Michel van der List on 31 Jan 2012 11:15:50 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Link Checker


I have used this python script in the past:

http://arthurdejong.org/webcheck/

I found it to work reasonably well for what I needed it for. YMMV.

Michel

On 01/31/2012 02:08 PM, Russ Thompson wrote:
I wrote a script several years ago which does just this using CURL.  My issue with using wget for this task is, I wouldn't consider just using 'wget' as of production caliber and is slow.

There is very little event, error handling etc.

I'll have see if I can locate the script, if not, should be a quick rewrite.  It essentially recursively goes through a site and pulls http status codes using CURL from every URL.  Any abnormal return codes are logged.  It also forked therefore could check many URL's at the same time, very fast.

-Russ

On Tue, Jan 31, 2012 at 1:15 AM, Fred Stluka <fred@bristle.com> wrote:
Keep in mind that many sites now send the user to a custom 404
page, and may not return a 404 status when they do.  I wrote a
Java applet 10 or more years ago to check the links on my links
page, but gave it up when I hit that snag.  Didn't want to write
code to analyze the returned page and decide it if was a 404 page
or a real page.

--Fred
------------------------------------------------------------------------
Fred Stluka -- mailto:fred@bristle.com -- http://bristle.com/~fred/
Bristle Software, Inc -- http://bristle.com -- Glad to be of service!
Open Source: Without walls and fences, we need no Windows or Gates.
------------------------------------------------------------------------


On 1/30/12 1:55 PM, brainbuz wrote:

I've been looking at link checkers, and while the http://validator.w3.org/checklink is pretty thorough, it is slow and the output doesn't lend itself readily to spreadsheet analysis.

I have a large (thousands of documents) website that I'm trying to generate a broken links report for.

    Links outside the site being checked should be verified but not followed any further.

    Links inside the site should be followed all the way.

    Since in some links are in _javascript_ it would be nice to follow links within _javascript_.

    There are a lot of vanity urls in place, which means most redirects are not errors, but we should have a report of redirects to review them.

    There is a lot of removed content, a seperate report of this would be nice, but we're not going to clean thousands of dead links out of pages.

Beyond what we need, an OSI compliant licensed product would be preferred, but I wouldn't rule out paying for a tool.



___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug



___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug