sean finney on 4 Dec 2009 13:19:07 -0800 |
(speaking of html parsing... your email was html-only) On Fri, Dec 04, 2009 at 04:01:05PM -0500, Fred Stluka wrote: > Keep in mind that most DOM parsers require syntactically correct > incoming HTML. That is "XHTML", not plain old sloppy HTML like most > Web pages still use. For example, tags must be properly nested, as: > <b><i>text</i></b> > not improperly as: > <b><i>text</b></i> > and no end tags can be missing, etc. May not work for most of > today's Web pages. beautifulsoup ftw. i've been happily using it for a number of scraping related tasks where the input pages were reliably unreliable. there's only one corner case i've seen where beautifulsoup wouldn't work, having something todo embedded html inside of <script> blocks. i think i was able to work around that with... a regexp. AND WE'VE COME FULL CIRCLE. def get_soup(self, fh): # there is some really nasty shit in these javascript blocks, and # it will break beautiful soup unless we clean it out ourselves first script_re = re.compile("<script[^>]*>.*?</script>", re.I) data = "".join(map( str.strip, fh.readlines())) data = re.sub(script_re, "<script><!-- removed --></script>", data) return BeautifulSoup.BeautifulSoup(data) sean Attachment:
signature.asc ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|