JP Vossen on 26 Oct 2009 12:07:46 -0700 |
> Date: Sun, 25 Oct 2009 10:07:19 -0400 > From: Doug Stewart <zamoose@gmail.com> > > Howdy all, > I've got a flat text file that contains a lot of text that was copied from a > PDF. Unfortunately, the copy process retained the formatted file's line > breaks, meaning that the flat text file has many unnecessary line breaks > that mess with the formatting if you change the dimensions of the view port. > > So what I need is a little sed/awk/regex magic that will search the text > file for all unnecessary line breaks and strip them out. You can identify > the unnecessary line breaks as follows: > 1) Proper line breaks are followed by a space on the beginning of the next > line, e.g. > " The quick brown fox" > 2) Improper line breaks have no space at the beginning, e.g. > "jumps over the lazy dog" > > So, I need to > 1) Detect all occurrences of lines with a leading space that > 2) Are followed by a line with NO leading space and > 3) Delete the line break between the two, essentially merging the two lines. One really brute force way to handle that is to replace all "\n {word}" with some temp symbol (}}}@{{{), then remove ALL newlines, then replace the temp symbol with newline. You can do that in MS Word, scarily enough, or certain text editors that handle whole-file regexps (e.g., vi, PFE (and probably emacs) can, NPP might but it's a giant pain). BUT, read on. Walt's solution is much better than brute force, but either Walt or I have read your requirements wrong, because by my interpretation Walt's solution is the opposite of what you want. Walt's solution will replace newline followed by space with just space, which I think will nuke the good newlines but leave the bad ones: perl -0777 -pe 's/\n / /g' bad_file > good_file To replace all "word\nword" (newlines) with "word word" (space), try: perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file > good_file I think... Unless I'm the one that read it backwards. Wouldn't be the first time I had a dyslexic moment... A better sample of input and desired output might be helpful. Like this: # Test data $ cat bad_file Line 1 line 2 line 3 line 4 line 5 line 6 line 7 line 8 # Mine $ perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file Line 1 line 2 line 3 line 4 line 5 line 6 line 7 line 8 # Walt's $ perl -0777 -pe 's/\n / /g' bad_file Line 1 line 2 line 3 line 4 line 5 line 6 line 7 line 8 To get rid of the leading space you can use Walt's version, tweaked a bit: $ perl -0777 -pe 's/(\w+)\n(\w+)/$1 $2/g' bad_file | perl -0777 -pe 's/\n /\n/g' Line 1 line 2 line 3 line 4 line 5 line 6 line 7 line 8 Let us know how it turns out, JP ----------------------------|:::======|------------------------------- JP Vossen, CISSP |:::======| http://bashcookbook.com/ My Account, My Opinions |=========| http://www.jpsdomain.org/ ----------------------------|=========|------------------------------- "Microsoft Tax" = the additional hardware & yearly fees for the add-on software required to protect Windows from its own poorly designed and implemented self, while the overhead incidentally flattens Moore's Law. ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|