Meng Weng Wong on Fri, 3 Nov 2000 15:23:41 -0500 (EST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Interesting problem...


On Thu, Nov 02, 2000 at 04:29:18PM -0500, Michael Grabenstein wrote:
| 
|     I have written a subroutine that uses regular expressions to
| parse XML.

i know this isn't the answer you're looking for, but i'm on
kneejerk duty, so here's the kneejerk response: "you can't
use regular expressions to write a parser!"

   How do I remove HTML from a string?

   The most correct way (albeit not the fastest) is to use
   HTML::Parse from CPAN (part of the HTML-Tree package on
   CPAN).

   Many folks attempt a simple-minded regular expression
   approach, like s/<.*?>//g, but that fails in many cases
   because the tags may continue over line breaks, they may
   contain quoted angle-brackets, or HTML comment may be
   present.  Plus folks forget to convert entities, like
   &lt; for example.

   Here's one "simple-minded" approach, that works for most files:

       #!/usr/bin/perl -p0777
       s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

   If you want a more complete solution, see the 3-stage
   striphtml program in
   http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz
   .

   Here are some tricky cases that you should think about
   when picking a solution:

       <IMG SRC = "foo.gif" ALT = "A > B">

       <IMG SRC = "foo.gif"
            ALT = "A > B">

       <!-- <A comment> -->

       <script>if (a<b && a>c)</script>

       <# Just data #>

       <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

   If HTML comments include other tags, those solutions
   would also break on text like this:

       <!-- This section commented out.
           <B>You can't see me!</B>
       -->


**Majordomo list services provided by PANIX <URL:http://www.panix.com>**
**To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**