JP Vossen on 26 Oct 2009 12:07:46 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Looking for a poke in the right sed/awk/regex direction


> Date: Sun, 25 Oct 2009 10:07:19 -0400
> From: Doug Stewart <zamoose@gmail.com>
> 
> Howdy all,
> I've got a flat text file that contains a lot of text that was copied from a
> PDF.  Unfortunately, the copy process retained the formatted file's line
> breaks, meaning that the flat text file has many unnecessary line breaks
> that mess with the formatting if you change the dimensions of the view port.
> 
> So what I need is a little sed/awk/regex magic that will search the text
> file for all unnecessary line breaks and strip them out.  You can identify
> the unnecessary line breaks as follows:
> 1) Proper line breaks are followed by a space on the beginning of the next
> line, e.g.
> " The quick brown fox"
> 2) Improper line breaks have no space at the beginning, e.g.
> "jumps over the lazy dog"
> 
> So, I need to
> 1) Detect all occurrences of lines with a leading space that
> 2) Are followed by a line with NO leading space and
> 3) Delete the line break between the two, essentially merging the two lines.

One really brute force way to handle that is to replace all "\n {word}" 
with some temp symbol (}}}@{{{), then remove ALL newlines, then replace 
the temp symbol with newline.  You can do that in MS Word, scarily 
enough, or certain text editors that handle whole-file regexps (e.g., 
vi, PFE (and probably emacs) can, NPP might but it's a giant pain). 
BUT, read on.

Walt's solution is much better than brute force, but either Walt or I 
have read your requirements wrong, because by my interpretation Walt's 
solution is the opposite of what you want.

Walt's solution will replace newline followed by space with just space, 
which I think will nuke the good newlines but leave the bad ones:
	perl -0777 -pe 's/\n / /g' bad_file > good_file

To replace all "word\nword" (newlines) with "word word" (space), try:
	perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file > good_file

I think...  Unless I'm the one that read it backwards.  Wouldn't be the 
first time I had a dyslexic moment...  A better sample of input and 
desired output might be helpful.  Like this:

# Test data
$ cat bad_file
Line 1
line 2
  line 3
  line 4
line 5
  line 6
  line 7
line 8


# Mine
$ perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file
Line 1 line 2
  line 3
  line 4 line 5
  line 6
  line 7 line 8


# Walt's
$ perl -0777 -pe 's/\n / /g' bad_file
Line 1
line 2 line 3 line 4
line 5 line 6 line 7
line 8


To get rid of the leading space you can use Walt's version, tweaked a bit:

$ perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file | perl -0777 -pe 
's/\n /\n/g'
Line 1 line 2
line 3
line 4 line 5
line 6
line 7 line 8


Let us know how it turns out,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug