| JP Vossen on 12 Oct 2012 15:03:49 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
| [PLUG] Perl one-liner to remove duplicates without changing file order |
1) Preserve the order of the lines 2) Keep the last occurrence (not the first) of the dupAll of the above solutions break both of those constraints, but here is a Perl one-liner that works: $ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist
Breakdown (best viewed in monospace font):
perl -ne ' # -n assume "while (<>) { ... }" loop around
# prog, don't automatically print
# -e <program> one line of program
$line{$_} = $. # Build %line hash with current input
# line $_ as key and current input line
# number $. as value
END { # END {} block, executed only once
# *after* the assumed while() loop
for ( # Loop through ...
sort{$line{$a}<=>$line{$b}}
# ...while sorting by the *value*
# (input line number) not the key...
keys %line) # ...the %line hash
{print} }' # and print the line (then end the END{} block)
/tmp/sample.hist # from this file
The magic is in the %line hash. Since the hash key is the actual line
I'm worried about, the program uses memory only for the size of the file
plus a little overhead, but minus any duplicates. That's because when I
hit a duplicate, the only change is that the input line number (hash
value) is updated. So when I sort the hash *by input line number* I get
only 1 copy of each line, in the place where it occurred last, which I
print.
One other bit of magic is the sort{} command, where I give sort explicit instructions on *what* to sort *how*. By default, it would do an alpha sort on hash key (sort %line), while I require a numeric sort (via the "spaceship" operator <=> (I swear I am not making this up)) on the hash value. In Perl that means: sort{$line{$a}<=>$line{$b}}. Or a reverse sort (read carefully): sort{$line{$b}<=>$line{$a}}.
Caveat: 'perl -i' does *not* work with this, I'm not sure why. You need to do it the old fashioned way: $ cp -av /tmp/sample.hist /tmp/sample.hist.bak && perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }' /tmp/sample.hist.bak > /tmp/sample.hist
Just for fun, this one will remove dups (including non-adjacent) but keep the first (not last which I need) occurrence. It is even simpler but more obscure; it adds the input line to the hash but if it's not already there also prints it:
perl -ne '$line{$_}++ or print' /tmp/sample.hist
These are awesome examples of the power, subtlety and unreadability of
Perl and Perlisms. :-) I mean seriously, does that not look like line
noise?
--------------------------------------------------------------------------------
Annotated sample:
$ cat -n /tmp/sample.hist
1 line1
2 line2
3 line3
4 1dup3x # bad dup
5 2dup2x # bad dup
6 line4
7 1dup3x # bad dup
8 line5
9 line6
10 1dup3x # good dup
11 line7
12 line8
13 2dup2x # good dup
14 line9
$ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys
%line) {print} }' /tmp/sample.hist | cat -n
1 line1
2 line2
3 line3
4 line4
5 line5
6 line6
7 1dup3x # now unique
8 line7
9 line8
10 2dup2x # now unique
11 line9
### Neat, but not what I needed
$ perl -ne '$line{$_}++ or print' /tmp/sample.hist | cat -n
1 line1
2 line2
3 line3
4 1dup3x
5 2dup2x
6 line4
7 line5
8 line6
9 line7
10 line8
11 line9
Enjoy,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP |:::======| http://bashcookbook.com/
My Account, My Opinions |=========| http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug