Re: [PLUG] sed newbie question

JP Vossen on 3 Dec 2009 22:13:56 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] sed newbie question

From: JP Vossen <jp@jpsdomain.org>

To: plug@lists.phillylinux.org

Subject: Re: [PLUG] sed newbie question

Date: Fri, 04 Dec 2009 01:13:48 -0500

Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>

Sender: plug-bounces@lists.phillylinux.org

User-agent: Thunderbird 2.0.0.23 (Windows/20090812)

> Date: Thu, 3 Dec 2009 18:30:52 -0500 > From: Michael Lazin <microlaser@gmail.com> > > Hi, I am interested in writing a shell script that will remove malicious > iframes from peoples websites. I am a scripting newbie and a sed newbie. I > decided to start by playing around with it. > > sed '/iframe/d' test.html First, what everyone one else said about parsing HTML is true, it's really painful to do except via a for-real-and-true HTML parser. Second, there's a great quote "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." I love that quote, but I also love regular expressions and use them every day. But in the case of parsing HTML, it's really true. Third, if you want to know more about regexps, go out and buy _Mastering Regular Expressions_ (AKA MRE)and a big bottle of aspirin. That is THE book on regexps, but you'll need the aspirin. Not because the book is bad, it's not, it's awesome, but because regular expressions are just complicated, and the book *is* dense. So I'll shut up about all of that. If I was utterly forced to do this, I'd use a Perl one-liner (probably wrapped in a shell script). 'sed' is smaller, so there is way less overhead to use it, but unless you are going to process millions of pages, the hassle isn't worth it IMO. For more details on the Perl solution, see Walt's Perl one-liner slides, they are great: http://www.mawode.com/~waltman/talks/one_liners.plugw.pdf # This is almost what you want, for certain values of "want." perl -0 -pe 's!<iframe>.*?</iframe>!!gis;' test.html -0 changes the default '\n' input record separator to an octal or hex number, which we just omitted, so we effectively "slurp" the file in one big gulp (thanks Walt!). If the file is bigger than available RAM that'll be a problem, but I doubt any reasonable HTML file is, and the slurped file lets us deal with multi-line iframe entities. -p wraps a print around your code, see perl -h. -e is your expression (code) s/// is the regular expression substitution operator (the sed part). But if we used '/' as a delimiter we'd have to escape some in the expression, so we pick-our-own-delimiters and I used '!' (gotta love Perl. Or hate it. Whichever.). 'g' is for a global change, instead of just the first, 'i' is ignore case, and 's' causes '.' to match on \n, which allows multi-line hits. /.*?/ is a non-greedy match, which is both what you need and faster than a greedy match (which would be /.*/, read MRE to learn about why). So we're saying: find a (case insensitive) "<iframe>" string, then *anything* else but only until you get to a (case insensitive) "</iframe>" string, and replace the whole thing with nothing. That'll still possibly leave some extra newlines laying around in the HTML code, but that shouldn't bother anything. That'll all go to STDOUT for testing. When you are happy, you can either do: perl -0 -pe 's!<iframe>.*?</iframe>!!gis;' test.html > fixed.test.html and fiddle around with the files as needed in shell code, or: perl -i -0 -pe 's!<iframe>.*?</iframe>!!gis;' test.html which will do an in-place edit (more-or-less) with no backup. Use -i .bak or something for a backup but that'll clutter up your dirs. Note you can't combine the -i or the -0 since those flags take optional arguments and will get really confused if they are combined. As noted this is not a "perfect" solution. But it's probably good enough. Just test the hell out of it, and do some kind of backup (even just a big zip or tarball) first! Good luck, JP ----------------------------|:::======|------------------------------- JP Vossen, CISSP |:::======| http://bashcookbook.com/ My Account, My Opinions |=========| http://www.jpsdomain.org/ ----------------------------|=========|------------------------------- "Microsoft Tax" = the additional hardware & yearly fees for the add-on software required to protect Windows from its own poorly designed and implemented self, while the overhead incidentally flattens Moore's Law. ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

Follow-Ups:

Re: [PLUG] sed newbie question
From: JP Vossen <jp@jpsdomain.org>

Prev by Date: Re: [PLUG] Self-hosted online backups?

Next by Date: Re: [PLUG] sed newbie question

Previous by thread: Re: [PLUG] sed newbie question

Next by thread: Re: [PLUG] sed newbie question

Index(es):

Date

Thread