Fred Stluka on 4 Dec 2009 13:01:12 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] sed newbie question


Keep in mind that most DOM parsers require syntactically correct
incoming HTML.  That is "XHTML", not plain old sloppy HTML like most
Web pages still use.  For example, tags must be properly nested, as:
    <b><i>text</i></b>
not improperly as:
    <b><i>text</b></i>
and no end tags can be missing, etc.  May not work for most of
today's Web pages.

--Fred
---------------------------------------------------------------------
Fred Stluka -- mailto:fred@bristle.com -- http://bristle.com/~fred/
Bristle Software, Inc -- http://bristle.com -- Glad to be of service!
---------------------------------------------------------------------


Eric wrote:
Michael:

There was a recent thread somewhere - I don't recall where - that concluded
"Don't use regular expressions to parse html!"  REs are very powerful but html
can be quite complex and even irregular and REs are not the right tool to make a
parser.

While writing this I noticed that Sean suggested DOM manipulation with Python.
Excellent idea.  SED, BASH, etc. just don't have what you're going to need to
create a reliable, effective solution.

Good luck.

Eric

Michael Lazin wrote:
  
Yeah, it's just a proof of concept, obviously this is gonna take some
work.  Out of curiosity is there a way to insert with sed, so you could
do something like inserting <!-- --> around the <iframe></iframe> tags? 
This might be better than removing a whole line of code.

    
On Dec 3, 2009 6:46 PM, "Douglas Muth" <doug.muth@gmail.com
<mailto:doug.muth@gmail.com>> wrote:

On Thu, Dec 3, 2009 at 6:30 PM, Michael Lazin <microlaser@gmail.com
<mailto:microlaser@gmail.com>> wrote: > Hi, I am interested in...

No idea, but I can tell you how I would do it:

cat test.html | sed -e s/iframe//g

Keep in mind that with that specific regexp, you'll be left with
broken HTML code.  I assume that's a proof of concept, though. :-)

Hope that helps,

-- Doug
___________________________________________________________________________
Philadelphia Linux Users Group         --      
 http://www.phillylinux.org
Announcements -
http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --  
http://lists.phillylinux.org/mailman/listinfo/plug
      
------------------------------------------------------------------------

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
    

  
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug