Russ Thompson on 30 Jan 2012 11:33:55 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Link Checker


Why use wget for this task?

Curl is the appropriate tool....

On Mon, Jan 30, 2012 at 2:20 PM, Amir Tahvildaran <amir@mathforum.org> wrote:
I've used wget for some simple spidering tasks like this before.  It leaves a little to be desired so I'm also interested in alternatives.

AFAIK there aren't any versions that do _javascript_ so that might be a dealbreaker for you.  Are you able to generate a listing of all of the documents that do exist on your site?  (You would be able to do this, for example, if your site was 100% static/served from the filesystem)

You might need to do two passes, one fully recursive but restricted to your domain.  Take the output from that, build a listing of links and then spider that (recursive, any domain, but only one level deep).

Here are some wget recipes:

The following will get all pages one level away from the urls listed in the file urls.wget

wget --page-requisites --recursive --level=1 --directory-prefix=directory_to_contain_files --convert-links --html-extension --input-file=urls.wget

The following will get all pages under the directory /goodwin/, starting at the main page

wget --page-requisites --recursive --no-parent --convert-links --html-extension -o log --directory-prefix=goodwin http://drexel.edu/goodwin/

If you use the -o log option, you can then grep the output for 40x, 50x errors.

grep -B2 -A1 ' [45]0[0-9] ' log
I'm using wget v1.12

-Amir

On Mon, Jan 30, 2012 at 1:55 PM, brainbuz <brainbuz@brainbuz.org> wrote:

I've been looking at link checkers, and while the http://validator.w3.org/checklink is pretty thorough, it is slow and the output doesn't lend itself readily to spreadsheet analysis.

I have a large (thousands of documents) website that I'm trying to generate a broken links report for.

     Links outside the site being checked should be verified but not followed any further.

     Links inside the site should be followed all the way.

     Since in some links are in _javascript_ it would be nice to follow links within _javascript_.

     There are a lot of vanity urls in place, which means most redirects are not errors, but we should have a report of redirects to review them.

     There is a lot of removed content, a seperate report of this would be nice, but we're not going to clean thousands of dead links out of pages.

Beyond what we need, an OSI compliant licensed product would be preferred, but I wouldn't rule out paying for a tool.

 

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug



___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug


___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug