Re: [PLUG] finding duplicate files

 Date: Sat, 22 Nov 2008 13:32:17 -0500
 From: Art Alexion
 > I have a directory with a lot of files, a number of which are
 > identical, except for filename.  What is the most efficient way to
 > find (and ultimately delete) the duplicates?

How about this?  TEST, TEST, TEST first!

# Assumes a recent version of bash [for nested $()]
# BACKUP, then capture md5 [1] hashes (don't put the output file in your 
CWD or you may recurs!)
$ cd /path/to/dir
$ cd ..
$ cp -a dir dir.BACKUP
$ cd dir
$ md5sum * > /tmp/md5s

# Display a list of the duplicate hashes
$ cut -d' ' -f1 /tmp/md5s | sort | uniq -d

# Display a list of ALL the duplicate files (don't delete these!)
$ for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; \
     grep "$hash" /tmp/md5s | cut -d' ' -f3-; done

# Now you have a blank line separated list of dups, and you can choose 
which to keep and which to nuke.  One way is to keep the first and nuke 
the others:
$ for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; \
     grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done

# TEST THIS using echo!  ARE YOU SURE?  Verify that the names of files 
you want to keep are not listed!
$ echo $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \
     do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)

# When you are sure, replace echo with rm:
$ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \
     do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)

Interesting commands:

* cut:	-d' ' uses space as the delimiter, -f3- for fields 3 to the end
* uniq:	-d shows only duplicated lines (hashes)
* tail:	-n+2 starts at line 2 and goes to the end (i.e., skips line 1)
* $():	sub-shell, legacy as backticks ``, but those are harder to read 
and not nestable.  I've nested here.
* for...done:	Takes each hash and greps for it, then give you just the 
file part

This is a good one, I'll add it to the second edition of the _bash 
Cookbook_, if/when.  Let me know how you make out.

[1] Yes, there are better/more secure hashing tools than md5, but md5 is 
almost always already there, esp. on Linux, and for this it doesn't matter.

