JP Vossen on 3 Dec 2009 22:13:56 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] sed newbie question


> Date: Thu, 3 Dec 2009 18:30:52 -0500
> From: Michael Lazin <microlaser@gmail.com>
> 
> Hi, I am interested in writing a shell script that will remove malicious
> iframes from peoples websites.  I am a scripting newbie and a sed newbie.  I
> decided to start by playing around with it.
> 
> sed '/iframe/d' test.html

First, what everyone one else said about parsing HTML is true, it's 
really painful to do except via a for-real-and-true HTML parser.

Second, there's a great quote "Some people, when confronted with a 
problem, think 'I know, I'll use regular expressions.'  Now they have 
two problems."  I love that quote, but I also love regular expressions 
and use them every day.  But in the case of parsing HTML, it's really true.

Third, if you want to know more about regexps, go out and buy _Mastering 
Regular Expressions_ (AKA MRE)and a big bottle of aspirin.  That is THE 
book on regexps, but you'll need the aspirin.  Not because the book is 
bad, it's not, it's awesome, but because regular expressions are just 
complicated, and the book *is* dense.


So I'll shut up about all of that.  If I was utterly forced to do this, 
I'd use a Perl one-liner (probably wrapped in a shell script).  'sed' is 
smaller, so there is way less overhead to use it, but unless you are 
going to process millions of pages, the hassle isn't worth it IMO.


For more details on the Perl solution, see Walt's Perl one-liner slides, 
they are great: http://www.mawode.com/~waltman/talks/one_liners.plugw.pdf


# This is almost what you want, for certain values of "want."
perl -0 -pe 's!<iframe>.*?</iframe>!!gis;' test.html

-0 changes the default '\n' input record separator to an octal or hex 
number, which we just omitted, so we effectively "slurp" the file in one 
big gulp (thanks Walt!).  If the file is bigger than available RAM 
that'll be a problem, but I doubt any reasonable HTML file is, and the 
slurped file lets us deal with multi-line iframe entities.

-p wraps a print around your code, see perl -h.

-e is your expression (code)

s/// is the regular expression substitution operator (the sed part). 
But if we used '/' as a delimiter we'd have to escape some in the 
expression, so we pick-our-own-delimiters and I used '!' (gotta love 
Perl.  Or hate it.  Whichever.).  'g' is for a global change, instead of 
just the first, 'i' is ignore case, and 's' causes '.' to match on \n, 
which allows multi-line hits.

/.*?/ is a non-greedy match, which is both what you need and faster than 
a greedy match (which would be /.*/, read MRE to learn about why).  So 
we're saying: find a (case insensitive) "<iframe>" string, then 
*anything* else but only until you get to a (case insensitive) 
"</iframe>" string, and replace the whole thing with nothing.  That'll 
still possibly leave some extra newlines laying around in the HTML code, 
but that shouldn't bother anything.

That'll all go to STDOUT for testing.  When you are happy, you can 
either do:

perl -0 -pe 's!<iframe>.*?</iframe>!!gis;' test.html > fixed.test.html

and fiddle around with the files as needed in shell code, or:

perl -i -0 -pe 's!<iframe>.*?</iframe>!!gis;' test.html

which will do an in-place edit (more-or-less) with no backup.  Use -i 
.bak or something for a backup but that'll clutter up your dirs.  Note 
you can't combine the -i or the -0 since those flags take optional 
arguments and will get really confused if they are combined.


As noted this is not a "perfect" solution.  But it's probably good 
enough.  Just test the hell out of it, and do some kind of backup (even 
just a big zip or tarball) first!

Good luck,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug