Walt Mankowski on 25 Oct 2009 08:05:32 -0700 |
On Sun, Oct 25, 2009 at 10:07:19AM -0400, Doug Stewart wrote: > Howdy all, > I've got a flat text file that contains a lot of text that was copied from a > PDF. Unfortunately, the copy process retained the formatted file's line > breaks, meaning that the flat text file has many unnecessary line breaks > that mess with the formatting if you change the dimensions of the view port. > > So what I need is a little sed/awk/regex magic that will search the text > file for all unnecessary line breaks and strip them out. You can identify > the unnecessary line breaks as follows: > 1) Proper line breaks are followed by a space on the beginning of the next > line, e.g. > " The quick brown fox" > 2) Improper line breaks have no space at the beginning, e.g. > "jumps over the lazy dog" > > So, I need to > 1) Detect all occurrences of lines with a leading space that > 2) Are followed by a line with NO leading space and > 3) Delete the line break between the two, essentially merging the two lines. > > Any ideas? I don't know sed and awk well enough to know for sure, but my impression is that they generally work on single lines at a time, so it might be tricky to get them to do something that involves two consecutive lines. However, it's really easy to do this in a perl one-liner: perl -0777 -pe 's/\n / /g' The combination of -0777 and -p makes perl read in the entire file, storing it in $_. Also, after the substitution command (the part inside the single quotes), -p prints out what's in $_. The substitution command, s/\n / /g, globally replaces each occurence of newline-space in $_ by a space. So this is really shorthand for the following: my $doc; while (my $line = <>) { $doc .= $line; } $doc =~ s/\n / /g; print $doc; Attachment:
signature.asc ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|