How did you extract links?  That seems like the hard part.  For example you might have an href that says "/bar" but the link is really "/foo/bar" if there is a base tag.

I would use CURL for embedding in a program (as opposed to calling out to command line wget) but again, parsing HTML to get the right links to follow seems like the challenge.  

Another option might be selenium.  Although it would be slower (but who cares if you only run it once in a while).  Selenium automates your browser (typically firefox).  So you can easily write a test script and have it verify functionality.  Psuedo-code example:
go to this page
click this link
verify (that its not an error code) and that the text "Welcome" appears somewhere on the page

The advantage here would be that you can handle _javascript_ created links (since its running a full browser).  What isn't quite natural for SeleniumIDE is the idea of following all of the links on a page.  You would probably want to use the more powerful Selenium WebDriver so you can write your program in java,python,ruby,etc.


I wrote a script several years ago which does just this using CURL.  My issue with using wget for this task is, I wouldn't consider just using 'wget' as of production caliber and is slow.

There is very little event, error handling etc.

I'll have see if I can locate the script, if not, should be a quick rewrite.  It essentially recursively goes through a site and pulls http status codes using CURL from every URL.  Any abnormal return codes are logged.  It also forked therefore could check many URL's at the same time, very fast.


Keep in mind that many sites now send the user to a custom 404
page, and may not return a 404 status when they do.  I wrote a
Java applet 10 or more years ago to check the links on my links
page, but gave it up when I hit that snag.  Didn't want to write
code to analyze the returned page and decide it if was a 404 page
or a real page.

I've been looking at link checkers, and while the is pretty thorough, it is slow and the output doesn't lend itself readily to spreadsheet analysis.

I have a large (thousands of documents) website that I'm trying to generate a broken links report for.

    Links outside the site being checked should be verified but not followed any further.

    Links inside the site should be followed all the way.

    Since in some links are in _javascript_ it would be nice to follow links within _javascript_.

    There are a lot of vanity urls in place, which means most redirects are not errors, but we should have a report of redirects to review them.

    There is a lot of removed content, a seperate report of this would be nice, but we're not going to clean thousands of dead links out of pages.

Beyond what we need, an OSI compliant licensed product would be preferred, but I wouldn't rule out paying for a tool.

