Frank Szczerba on 22 Oct 2012 14:15:10 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Perl one-liner to remove duplicates without changing file order


I'm way behind on emails, so apologies if someone else has already mentioned it, but the -i doesn't work because you are printing your output in an END block. If you give perl -i multiple files to process, it will edit each one in place separately, but the END block doesn't run until after all files have been processed.

You can make this work with -i by doing:

$ perl -i -ne '$line{$_} = $.; if (eof) { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }'

If you really are processing multiple files, you probably also want to clear %line in between files:

$ perl -i -ne '$line{$_} = $.; if (eof) { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} %line = () }'

Frank

On Oct 12, 2012, at 6:03 PM, JP Vossen <jp@jpsdomain.org> wrote:

> I found myself having to process a "history" file and remove duplicates (which may or may not be adjacent).  Normally I'd use 'sort -u' or 'uniq' for that, or possibly some kind of hash, but I also needed to:
> 	1) Preserve the order of the lines
> 	2) Keep the last occurrence (not the first) of the dup
> 
> All of the above solutions break both of those constraints, but here is a Perl one-liner that works:
> 	$ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist
> 
> Breakdown (best viewed in monospace font):
> 	perl -ne '       # -n assume "while (<>) { ... }" loop around
>                           # prog, don't automatically print
>                         # -e <program>        one line of program
> 	$line{$_} = $.   # Build %line hash with current input
>                           # line $_ as key and current input line
>                           # number $. as value
> 	END {            # END {} block, executed only once
>                           # *after* the assumed while() loop
> 	for (            # Loop through ...
> 	sort{$line{$a}<=>$line{$b}}
>                         # ...while sorting by the *value*
>                           # (input line number) not the key...
> 	keys %line)      # ...the %line hash
> 	{print} }'       # and print the line (then end the END{} block)
> 	/tmp/sample.hist # from this file
> 
> The magic is in the %line hash.  Since the hash key is the actual line I'm worried about, the program uses memory only for the size of the file plus a little overhead, but minus any duplicates.  That's because when I hit a duplicate, the only change is that the input line number (hash value) is updated.  So when I sort the hash *by input line number* I get only 1 copy of each line, in the place where it occurred last, which I print.
> 
> One other bit of magic is the sort{} command, where I give sort explicit instructions on *what* to sort *how*.  By default, it would do an alpha sort on hash key (sort %line), while I require a numeric sort (via the "spaceship" operator <=> (I swear I am not making this up)) on the hash value.  In Perl that means: sort{$line{$a}<=>$line{$b}}.  Or a reverse sort (read carefully): sort{$line{$b}<=>$line{$a}}.
> 
> Caveat: 'perl -i' does *not* work with this, I'm not sure why.  You need to do it the old fashioned way:
> 	$ cp -av /tmp/sample.hist /tmp/sample.hist.bak && perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist.bak > /tmp/sample.hist
> 
> 
> Just for fun, this one will remove dups (including non-adjacent) but keep the first (not last which I need) occurrence.  It is even simpler but more obscure; it adds the input line to the hash but if it's not already there also prints it:
> 	perl -ne '$line{$_}++ or print' /tmp/sample.hist
> 
> These are awesome examples of the power, subtlety and unreadability of Perl and Perlisms.  :-)  I mean seriously, does that not look like line noise?
> 
> 
> --------------------------------------------------------------------------------
> Annotated sample:
> 
> $ cat -n /tmp/sample.hist
>     1  line1
>     2  line2
>     3  line3
>     4  1dup3x		# bad dup
>     5  2dup2x		# bad dup
>     6  line4
>     7  1dup3x		# bad dup
>     8  line5
>     9  line6
>    10  1dup3x		# good dup
>    11  line7
>    12  line8
>    13  2dup2x		# good dup
>    14  line9
> 
> $ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist | cat -n
>     1  line1
>     2  line2
>     3  line3
>     4  line4
>     5  line5
>     6  line6
>     7  1dup3x		# now unique
>     8  line7
>     9  line8
>    10  2dup2x		# now unique
>    11  line9
> 
> 
> ### Neat, but not what I needed
> $ perl -ne '$line{$_}++ or print' /tmp/sample.hist | cat -n
>     1  line1
>     2  line2
>     3  line3
>     4  1dup3x
>     5  2dup2x
>     6  line4
>     7  line5
>     8  line6
>     9  line7
>    10  line8
>    11  line9
> 
> 
> Enjoy,
> JP
> ----------------------------|:::======|-------------------------------
> JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
> My Account, My Opinions     |=========|      http://www.jpsdomain.org/
> ----------------------------|=========|-------------------------------
> "Microsoft Tax" = the additional hardware & yearly fees for the add-on
> software required to protect Windows from its own poorly designed and
> implemented self, while the overhead incidentally flattens Moore's Law.
> ___________________________________________________________________________
> Philadelphia Linux Users Group         --        http://www.phillylinux.org
> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
> General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug