|
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
|
Re: [PLUG] Looking for a poke in the right sed/awk/regex direction
|
> Date: Sun, 25 Oct 2009 10:07:19 -0400
> From: Doug Stewart <zamoose@gmail.com>
>
> Howdy all,
> I've got a flat text file that contains a lot of text that was copied from a
> PDF. Unfortunately, the copy process retained the formatted file's line
> breaks, meaning that the flat text file has many unnecessary line breaks
> that mess with the formatting if you change the dimensions of the view port.
>
> So what I need is a little sed/awk/regex magic that will search the text
> file for all unnecessary line breaks and strip them out. You can identify
> the unnecessary line breaks as follows:
> 1) Proper line breaks are followed by a space on the beginning of the next
> line, e.g.
> " The quick brown fox"
> 2) Improper line breaks have no space at the beginning, e.g.
> "jumps over the lazy dog"
>
> So, I need to
> 1) Detect all occurrences of lines with a leading space that
> 2) Are followed by a line with NO leading space and
> 3) Delete the line break between the two, essentially merging the two lines.
One really brute force way to handle that is to replace all "\n {word}"
with some temp symbol (}}}@{{{), then remove ALL newlines, then replace
the temp symbol with newline. You can do that in MS Word, scarily
enough, or certain text editors that handle whole-file regexps (e.g.,
vi, PFE (and probably emacs) can, NPP might but it's a giant pain).
BUT, read on.
Walt's solution is much better than brute force, but either Walt or I
have read your requirements wrong, because by my interpretation Walt's
solution is the opposite of what you want.
Walt's solution will replace newline followed by space with just space,
which I think will nuke the good newlines but leave the bad ones:
perl -0777 -pe 's/\n / /g' bad_file > good_file
To replace all "word\nword" (newlines) with "word word" (space), try:
perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file > good_file
I think... Unless I'm the one that read it backwards. Wouldn't be the
first time I had a dyslexic moment... A better sample of input and
desired output might be helpful. Like this:
# Test data
$ cat bad_file
Line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
# Mine
$ perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file
Line 1 line 2
line 3
line 4 line 5
line 6
line 7 line 8
# Walt's
$ perl -0777 -pe 's/\n / /g' bad_file
Line 1
line 2 line 3 line 4
line 5 line 6 line 7
line 8
To get rid of the leading space you can use Walt's version, tweaked a bit:
$ perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file | perl -0777 -pe
's/\n /\n/g'
Line 1 line 2
line 3
line 4 line 5
line 6
line 7 line 8
Let us know how it turns out,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP |:::======| http://bashcookbook.com/
My Account, My Opinions |=========| http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|