JP Vossen on 12 Oct 2012 15:03:49 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[PLUG] Perl one-liner to remove duplicates without changing file order


I found myself having to process a "history" file and remove duplicates (which may or may not be adjacent). Normally I'd use 'sort -u' or 'uniq' for that, or possibly some kind of hash, but I also needed to:
	1) Preserve the order of the lines
	2) Keep the last occurrence (not the first) of the dup

All of the above solutions break both of those constraints, but here is a Perl one-liner that works: $ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist

Breakdown (best viewed in monospace font):
	perl -ne '       # -n assume "while (<>) { ... }" loop around
                           # prog, don't automatically print
                         # -e <program>        one line of program
	$line{$_} = $.   # Build %line hash with current input
                           # line $_ as key and current input line
                           # number $. as value
	END {            # END {} block, executed only once
                           # *after* the assumed while() loop
	for (            # Loop through ...
	sort{$line{$a}<=>$line{$b}}
                         # ...while sorting by the *value*
                           # (input line number) not the key...
	keys %line)      # ...the %line hash
	{print} }'       # and print the line (then end the END{} block)
	/tmp/sample.hist # from this file

The magic is in the %line hash. Since the hash key is the actual line I'm worried about, the program uses memory only for the size of the file plus a little overhead, but minus any duplicates. That's because when I hit a duplicate, the only change is that the input line number (hash value) is updated. So when I sort the hash *by input line number* I get only 1 copy of each line, in the place where it occurred last, which I print.

One other bit of magic is the sort{} command, where I give sort explicit instructions on *what* to sort *how*. By default, it would do an alpha sort on hash key (sort %line), while I require a numeric sort (via the "spaceship" operator <=> (I swear I am not making this up)) on the hash value. In Perl that means: sort{$line{$a}<=>$line{$b}}. Or a reverse sort (read carefully): sort{$line{$b}<=>$line{$a}}.

Caveat: 'perl -i' does *not* work with this, I'm not sure why. You need to do it the old fashioned way: $ cp -av /tmp/sample.hist /tmp/sample.hist.bak && perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist.bak > /tmp/sample.hist


Just for fun, this one will remove dups (including non-adjacent) but keep the first (not last which I need) occurrence. It is even simpler but more obscure; it adds the input line to the hash but if it's not already there also prints it:
	perl -ne '$line{$_}++ or print' /tmp/sample.hist

These are awesome examples of the power, subtlety and unreadability of Perl and Perlisms. :-) I mean seriously, does that not look like line noise?


--------------------------------------------------------------------------------
Annotated sample:

$ cat -n /tmp/sample.hist
     1  line1
     2  line2
     3  line3
     4  1dup3x		# bad dup
     5  2dup2x		# bad dup
     6  line4
     7  1dup3x		# bad dup
     8  line5
     9  line6
    10  1dup3x		# good dup
    11  line7
    12  line8
    13  2dup2x		# good dup
    14  line9

$ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist | cat -n
     1  line1
     2  line2
     3  line3
     4  line4
     5  line5
     6  line6
     7  1dup3x		# now unique
     8  line7
     9  line8
    10  2dup2x		# now unique
    11  line9


### Neat, but not what I needed
$ perl -ne '$line{$_}++ or print' /tmp/sample.hist | cat -n
     1  line1
     2  line2
     3  line3
     4  1dup3x
     5  2dup2x
     6  line4
     7  line5
     8  line6
     9  line7
    10  line8
    11  line9


Enjoy,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug