Re: [PLUG] sed newbie question

sean finney on 4 Dec 2009 13:19:07 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] sed newbie question

From: sean finney <seanius@seanius.net>

To: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>

Subject: Re: [PLUG] sed newbie question

Date: Fri, 4 Dec 2009 22:18:57 +0100

Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>

Sender: plug-bounces@lists.phillylinux.org

User-agent: Mutt/1.5.20 (2009-06-14)

(speaking of html parsing... your email was html-only) On Fri, Dec 04, 2009 at 04:01:05PM -0500, Fred Stluka wrote: > Keep in mind that most DOM parsers require syntactically correct > incoming HTML. That is "XHTML", not plain old sloppy HTML like most > Web pages still use. For example, tags must be properly nested, as: > <b><i>text</i></b> > not improperly as: > <b><i>text</b></i> > and no end tags can be missing, etc. May not work for most of > today's Web pages. beautifulsoup ftw. i've been happily using it for a number of scraping related tasks where the input pages were reliably unreliable. there's only one corner case i've seen where beautifulsoup wouldn't work, having something todo embedded html inside of <script> blocks. i think i was able to work around that with... a regexp. AND WE'VE COME FULL CIRCLE. def get_soup(self, fh): # there is some really nasty shit in these javascript blocks, and # it will break beautiful soup unless we clean it out ourselves first script_re = re.compile("<script[^>]*>.*?</script>", re.I) data = "".join(map( str.strip, fh.readlines())) data = re.sub(script_re, "<script></script>", data) return BeautifulSoup.BeautifulSoup(data) sean
Attachment: signature.asc
Description: Digital signature

___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

References:

[PLUG] sed newbie question
From: Michael Lazin <microlaser@gmail.com>

Re: [PLUG] sed newbie question
From: Douglas Muth <doug.muth@gmail.com>

Re: [PLUG] sed newbie question
From: Michael Lazin <microlaser@gmail.com>

Re: [PLUG] sed newbie question
From: Eric <eric@lucii.org>

Re: [PLUG] sed newbie question
From: Fred Stluka <fred@bristle.com>

Prev by Date: Re: [PLUG] sed newbie question

Next by Date: Re: [PLUG] Throttling network bandwidth usage ala nice or ionice...

Previous by thread: Re: [PLUG] sed newbie question

Next by thread: Re: [PLUG] sed newbie question

Index(es):

Date

Thread