Doug Stewart on 28 Oct 2009 15:31:55 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Looking for a poke in the right sed/awk/regex direction


On 10/26/09, JP Vossen <jp@jpsdomain.org> wrote:
> To replace all "word\nword" (newlines) with "word word" (space), try:
> 	perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file > good_file

JP:
Yours was the closest answer. The secret was in the multi-line /m flag.

I passed the text through this:

perl -0777 -pe 's/\n(\S+)/$1 $2/gm' bad_file > good file

...and it resulted in a much, MUCH cleaner file.  Note that I removed
the requirement for a \w word match to begin the expression and subbed
in a \S for the second \w; with the multi-line flag and a subbing-out
for non-whitespace instead of word characters (because a line could
conceivably start with a number or a quote), I reached
almost-data-processing-Nirvana.

There's still a little manual clean-up (and I'm going to want to trim
leading whitespace off all lines -- a trivial task now), but
by-and-large, after much experimentation (and cursing at
http://regexr.com), I've got the file into a workable format.

Thanks much!

-- 
-Doug
http://literalbarrage.org/blog/
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug