Re: Interesting problem...

Meng Weng Wong on Fri, 3 Nov 2000 15:23:41 -0500 (EST)

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Interesting problem...

From: Meng Weng Wong <mengwong@dumbo.pobox.com>

To: phl@lists.pm.org

Subject: Re: Interesting problem...

Date: Fri, 3 Nov 2000 15:23:24 -0500

Reply-to: phl@lists.pm.org

Sender: owner-phl@lists.pm.org

User-agent: Mutt/1.2i

On Thu, Nov 02, 2000 at 04:29:18PM -0500, Michael Grabenstein wrote: | | I have written a subroutine that uses regular expressions to | parse XML. i know this isn't the answer you're looking for, but i'm on kneejerk duty, so here's the kneejerk response: "you can't use regular expressions to write a parser!" How do I remove HTML from a string? The most correct way (albeit not the fastest) is to use HTML::Parse from CPAN (part of the HTML-Tree package on CPAN). Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus folks forget to convert entities, like < for example. Here's one "simple-minded" approach, that works for most files: #!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz . Here are some tricky cases that you should think about when picking a solution: <IMG SRC = "foo.gif" ALT = "A > B"> <IMG SRC = "foo.gif" ALT = "A > B">  <script>if (a<b && a>c)</script> <# Just data #> <![INCLUDE CDATA [ >>>>>>>>>>>> ]]> If HTML comments include other tags, those solutions would also break on text like this:  **Majordomo list services provided by PANIX <URL:http://www.panix.com>** **To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**

Follow-Ups:

Re: Interesting problem...
From: Michael Grabenstein <mgrabens@popd.isinet.com>

References:

Interesting problem...
From: Michael Grabenstein <mgrabens@popd.isinet.com>

Prev by Date: Monday night?

Next by Date: Re: Interesting problem...

Previous by thread: Re: Interesting problem...

Next by thread: Re: Interesting problem...

Index(es):

Date

Thread