| Walt Mankowski on 25 Oct 2009 08:05:32 -0700 |
|
On Sun, Oct 25, 2009 at 10:07:19AM -0400, Doug Stewart wrote:
> Howdy all,
> I've got a flat text file that contains a lot of text that was copied from a
> PDF. Unfortunately, the copy process retained the formatted file's line
> breaks, meaning that the flat text file has many unnecessary line breaks
> that mess with the formatting if you change the dimensions of the view port.
>
> So what I need is a little sed/awk/regex magic that will search the text
> file for all unnecessary line breaks and strip them out. You can identify
> the unnecessary line breaks as follows:
> 1) Proper line breaks are followed by a space on the beginning of the next
> line, e.g.
> " The quick brown fox"
> 2) Improper line breaks have no space at the beginning, e.g.
> "jumps over the lazy dog"
>
> So, I need to
> 1) Detect all occurrences of lines with a leading space that
> 2) Are followed by a line with NO leading space and
> 3) Delete the line break between the two, essentially merging the two lines.
>
> Any ideas?
I don't know sed and awk well enough to know for sure, but my
impression is that they generally work on single lines at a time, so
it might be tricky to get them to do something that involves two
consecutive lines. However, it's really easy to do this in a perl
one-liner:
perl -0777 -pe 's/\n / /g'
The combination of -0777 and -p makes perl read in the entire file,
storing it in $_. Also, after the substitution command (the part
inside the single quotes), -p prints out what's in $_. The
substitution command, s/\n / /g, globally replaces each occurence of
newline-space in $_ by a space.
So this is really shorthand for the following:
my $doc;
while (my $line = <>) {
$doc .= $line;
}
$doc =~ s/\n / /g;
print $doc;
Attachment:
signature.asc ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|