[PLUG] Perl one-liner to remove duplicates without changing file order

JP Vossen on 12 Oct 2012 15:03:49 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[PLUG] Perl one-liner to remove duplicates without changing file order

From: JP Vossen <jp@jpsdomain.org>
To: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Subject: [PLUG] Perl one-liner to remove duplicates without changing file order
Date: Fri, 12 Oct 2012 18:03:44 -0400
Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Sender: plug-bounces@lists.phillylinux.org
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120912 Thunderbird/15.0.1

I found myself having to process a "history" file and remove duplicates(which may or may not be adjacent). Normally I'd use 'sort -u' or'uniq' for that, or possibly some kind of hash, but I also needed to:

	1) Preserve the order of the lines
	2) Keep the last occurrence (not the first) of the dup

All of the above solutions break both of those constraints, but here isa Perl one-liner that works:$ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys%line) {print} }' /tmp/sample.hist


Breakdown (best viewed in monospace font):
	perl -ne '       # -n assume "while (<>) { ... }" loop around
                           # prog, don't automatically print
                         # -e <program>        one line of program
	$line{$_} = $.   # Build %line hash with current input
                           # line $_ as key and current input line
                           # number $. as value
	END {            # END {} block, executed only once
                           # *after* the assumed while() loop
	for (            # Loop through ...
	sort{$line{$a}<=>$line{$b}}
                         # ...while sorting by the *value*
                           # (input line number) not the key...
	keys %line)      # ...the %line hash
	{print} }'       # and print the line (then end the END{} block)
	/tmp/sample.hist # from this file

The magic is in the %line hash. Since the hash key is the actual lineI'm worried about, the program uses memory only for the size of the fileplus a little overhead, but minus any duplicates. That's because when Ihit a duplicate, the only change is that the input line number (hashvalue) is updated. So when I sort the hash *by input line number* I getonly 1 copy of each line, in the place where it occurred last, which Iprint.

One other bit of magic is the sort{} command, where I give sort explicitinstructions on *what* to sort *how*. By default, it would do an alphasort on hash key (sort %line), while I require a numeric sort (via the"spaceship" operator <=> (I swear I am not making this up)) on the hashvalue. In Perl that means: sort{$line{$a}<=>$line{$b}}. Or a reversesort (read carefully): sort{$line{$b}<=>$line{$a}}.

Caveat: 'perl -i' does *not* work with this, I'm not sure why. You needto do it the old fashioned way:$ cp -av /tmp/sample.hist /tmp/sample.hist.bak && perl -ne '$line{$_} =$.; END { for (sort{$line{$a}<=>$line{$b}} keys %line) {print} }'/tmp/sample.hist.bak > /tmp/sample.hist

Just for fun, this one will remove dups (including non-adjacent) butkeep the first (not last which I need) occurrence. It is even simplerbut more obscure; it adds the input line to the hash but if it's notalready there also prints it:

	perl -ne '$line{$_}++ or print' /tmp/sample.hist

These are awesome examples of the power, subtlety and unreadability ofPerl and Perlisms. :-) I mean seriously, does that not look like linenoise?



--------------------------------------------------------------------------------
Annotated sample:

$ cat -n /tmp/sample.hist
     1  line1
     2  line2
     3  line3
     4  1dup3x		# bad dup
     5  2dup2x		# bad dup
     6  line4
     7  1dup3x		# bad dup
     8  line5
     9  line6
    10  1dup3x		# good dup
    11  line7
    12  line8
    13  2dup2x		# good dup
    14  line9

$ perl -ne '$line{$_} = $.; END { for (sort{$line{$a}<=>$line{$b}} keys%line) {print} }' /tmp/sample.hist | cat -n

     1  line1
     2  line2
     3  line3
     4  line4
     5  line5
     6  line6
     7  1dup3x		# now unique
     8  line7
     9  line8
    10  2dup2x		# now unique
    11  line9


### Neat, but not what I needed
$ perl -ne '$line{$_}++ or print' /tmp/sample.hist | cat -n
     1  line1
     2  line2
     3  line3
     4  1dup3x
     5  2dup2x
     6  line4
     7  line5
     8  line6
     9  line7
    10  line8
    11  line9


Enjoy,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

Follow-Ups:
- Re: [PLUG] Perl one-liner to remove duplicates without changing file order
  - From: Frank Szczerba <frank@szczerba.net>

Prev by Date: Re: [PLUG] free books
Next by Date: [PLUG] perl -i question
Previous by thread: Re: [PLUG] free books
Next by thread: Re: [PLUG] Perl one-liner to remove duplicates without changing file order
Index(es):
- Date
- Thread