JP Vossen on 12 Oct 2012 15:03:49 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
[PLUG] Perl one-liner to remove duplicates without changing file order |
1) Preserve the order of the lines 2) Keep the last occurrence (not the first) of the dupAll of the above solutions break both of those constraints, but here is a Perl one-liner that works: $ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist
Breakdown (best viewed in monospace font): perl -ne ' # -n assume "while (<>) { ... }" loop around # prog, don't automatically print # -e <program> one line of program $line{$_} = $. # Build %line hash with current input # line $_ as key and current input line # number $. as value END { # END {} block, executed only once # *after* the assumed while() loop for ( # Loop through ... sort{$line{$a}<=>$line{$b}} # ...while sorting by the *value* # (input line number) not the key... keys %line) # ...the %line hash {print} }' # and print the line (then end the END{} block) /tmp/sample.hist # from this fileThe magic is in the %line hash. Since the hash key is the actual line I'm worried about, the program uses memory only for the size of the file plus a little overhead, but minus any duplicates. That's because when I hit a duplicate, the only change is that the input line number (hash value) is updated. So when I sort the hash *by input line number* I get only 1 copy of each line, in the place where it occurred last, which I print.
One other bit of magic is the sort{} command, where I give sort explicit instructions on *what* to sort *how*. By default, it would do an alpha sort on hash key (sort %line), while I require a numeric sort (via the "spaceship" operator <=> (I swear I am not making this up)) on the hash value. In Perl that means: sort{$line{$a}<=>$line{$b}}. Or a reverse sort (read carefully): sort{$line{$b}<=>$line{$a}}.
Caveat: 'perl -i' does *not* work with this, I'm not sure why. You need to do it the old fashioned way: $ cp -av /tmp/sample.hist /tmp/sample.hist.bak && perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist.bak > /tmp/sample.hist
Just for fun, this one will remove dups (including non-adjacent) but keep the first (not last which I need) occurrence. It is even simpler but more obscure; it adds the input line to the hash but if it's not already there also prints it:
perl -ne '$line{$_}++ or print' /tmp/sample.histThese are awesome examples of the power, subtlety and unreadability of Perl and Perlisms. :-) I mean seriously, does that not look like line noise?
-------------------------------------------------------------------------------- Annotated sample: $ cat -n /tmp/sample.hist 1 line1 2 line2 3 line3 4 1dup3x # bad dup 5 2dup2x # bad dup 6 line4 7 1dup3x # bad dup 8 line5 9 line6 10 1dup3x # good dup 11 line7 12 line8 13 2dup2x # good dup 14 line9$ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist | cat -n
1 line1 2 line2 3 line3 4 line4 5 line5 6 line6 7 1dup3x # now unique 8 line7 9 line8 10 2dup2x # now unique 11 line9 ### Neat, but not what I needed $ perl -ne '$line{$_}++ or print' /tmp/sample.hist | cat -n 1 line1 2 line2 3 line3 4 1dup3x 5 2dup2x 6 line4 7 line5 8 line6 9 line7 10 line8 11 line9 Enjoy, JP ----------------------------|:::======|------------------------------- JP Vossen, CISSP |:::======| http://bashcookbook.com/ My Account, My Opinions |=========| http://www.jpsdomain.org/ ----------------------------|=========|------------------------------- "Microsoft Tax" = the additional hardware & yearly fees for the add-on software required to protect Windows from its own poorly designed and implemented self, while the overhead incidentally flattens Moore's Law. ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug