sean finney on 4 Dec 2009 13:19:07 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] sed newbie question


(speaking of html parsing... your email was html-only)

On Fri, Dec 04, 2009 at 04:01:05PM -0500, Fred Stluka wrote:
>    Keep in mind that most DOM parsers require syntactically correct
>    incoming HTML.  That is "XHTML", not plain old sloppy HTML like most
>    Web pages still use.  For example, tags must be properly nested, as:
>        <b><i>text</i></b>
>    not improperly as:
>        <b><i>text</b></i>
>    and no end tags can be missing, etc.  May not work for most of
>    today's Web pages.

beautifulsoup ftw.  i've been happily using it for a number of scraping
related tasks where the input pages were reliably unreliable.

there's only one corner case i've seen where beautifulsoup wouldn't
work, having something todo embedded html inside of <script> blocks.
i think i was able to work around that with... a regexp.  AND WE'VE COME
FULL CIRCLE.

    def get_soup(self, fh):
        # there is some really nasty shit in these javascript blocks, and
        # it will break beautiful soup unless we clean it out ourselves first
        script_re = re.compile("<script[^>]*>.*?</script>", re.I)
        data = "".join(map( str.strip, fh.readlines()))
        data = re.sub(script_re, "<script><!-- removed --></script>", data)
        return BeautifulSoup.BeautifulSoup(data)



	sean

Attachment: signature.asc
Description: Digital signature

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug