Looking for a poke in the right sed/awk/regex direction

On Sun, Oct 25, 2009 at 10:07:19AM -0400, Doug Stewart wrote:
> Howdy all,
> I've got a flat text file that contains a lot of text that was copied from a
> PDF.  Unfortunately, the copy process retained the formatted file's line
> breaks, meaning that the flat text file has many unnecessary line breaks
> that mess with the formatting if you change the dimensions of the view port.
> So what I need is a little sed/awk/regex magic that will search the text
> file for all unnecessary line breaks and strip them out.  You can identify
> the unnecessary line breaks as follows:
> 1) Proper line breaks are followed by a space on the beginning of the next
> line, e.g.
> " The quick brown fox"
> 2) Improper line breaks have no space at the beginning, e.g.
> "jumps over the lazy dog"
> So, I need to
> 1) Detect all occurrences of lines with a leading space that
> 2) Are followed by a line with NO leading space and
> 3) Delete the line break between the two, essentially merging the two lines.
> Any ideas?

I don't know sed and awk well enough to know for sure, but my
impression is that they generally work on single lines at a time, so
it might be tricky to get them to do something that involves two
consecutive lines.  However, it's really easy to do this in a perl

  perl -0777 -pe 's/\n / /g'

The combination of -0777 and -p makes perl read in the entire file,
storing it in $_.  Also, after the substitution command (the part
inside the single quotes), -p prints out what's in $_.  The
substitution command, s/\n / /g, globally replaces each occurence of
newline-space in $_ by a space.

So this is really shorthand for the following:

  my $doc;
  while (my $line = <>) {
    $doc .= $line;
  $doc =~ s/\n / /g;
  print $doc;

