Michael Grabenstein on Thu, 9 Nov 2000 11:04:57 -0500 (EST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Interesting problem...


    I know the pitfalls you speak of... The patterns I came up with
effectively skirt around those issues.

    Remember when run on the command line the performance is great, only
when run in the CGI does the performance lag...

Later,
    Mike

Meng Weng Wong wrote:

> On Thu, Nov 02, 2000 at 04:29:18PM -0500, Michael Grabenstein wrote:
> |
> |     I have written a subroutine that uses regular expressions to
> | parse XML.
>
> i know this isn't the answer you're looking for, but i'm on
> kneejerk duty, so here's the kneejerk response: "you can't
> use regular expressions to write a parser!"
>
>    How do I remove HTML from a string?
>
>    The most correct way (albeit not the fastest) is to use
>    HTML::Parse from CPAN (part of the HTML-Tree package on
>    CPAN).
>
>    Many folks attempt a simple-minded regular expression
>    approach, like s/<.*?>//g, but that fails in many cases
>    because the tags may continue over line breaks, they may
>    contain quoted angle-brackets, or HTML comment may be
>    present.  Plus folks forget to convert entities, like
>    &lt; for example.
>
>    Here's one "simple-minded" approach, that works for most files:
>
>        #!/usr/bin/perl -p0777
>        s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
>
>    If you want a more complete solution, see the 3-stage
>    striphtml program in
>    http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz
>    .
>
>    Here are some tricky cases that you should think about
>    when picking a solution:
>
>        <IMG SRC = "foo.gif" ALT = "A > B">
>
>        <IMG SRC = "foo.gif"
>             ALT = "A > B">
>
>        <!-- <A comment> -->
>
>        <script>if (a<b && a>c)</script>
>
>        <# Just data #>
>
>        <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
>
>    If HTML comments include other tags, those solutions
>    would also break on text like this:
>
>        <!-- This section commented out.
>            <B>You can't see me!</B>
>        -->
>
> **Majordomo list services provided by PANIX <URL:http://www.panix.com>**
> **To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**

--
kill -9'em All and let Root sort'em out
         --From Slashdot
Opinions, Flames, Irritations are solely mine.
Useful, Productive information... the company claims.


**Majordomo list services provided by PANIX <URL:http://www.panix.com>**
**To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**