kamiza103 on 22 Nov 2008 12:50:54 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] finding duplicate files


Did you just write that while checking your mail?!


JP Vossen <jp@jpsdomain.org> wrote:
> Date: Sat, 22 Nov 2008 13:32:17 -0500
> From: Art Alexion
>
> I have a directory with a lot of files, a number of which are
> identical, except for filename. What is the most efficient way to
> find (and ultimately delete) the duplicates?

How about this? TEST, TEST, TEST first!

# Assumes a recent version of bash [for nested $()]
# BACKUP, then capture md5 [1] hashes (don't put the output file in your
CWD or you may recurs!)
$ cd /path/to/dir
$ cd ..
$ cp -a dir dir.BACKUP
$ cd dir
$ md5sum * > /tmp/md5s

# Display a list of the duplicate hashes
$ cut -d' ' -f1 /tmp/md5s | sort | uniq -d

# Display a list of ALL the duplicate files (don't delete these!)
$ for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; \
grep "$hash" /tmp/md5s | cut -d' ' -f3-; done

# Now you have a blank line separated list of dups, and you can choose
which to keep and which to nuke. One way is to keep the first and nuke
the others:
$ for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; \
grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done

# TEST THIS using echo! ARE YOU SURE? Verify that the names of files
you want to keep are not listed!
$ echo $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \
do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)

# When you are sure, replace echo with rm:
$ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \
do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)


~~~~~~~~~~~~~~~~~~~~~~
Interesting commands:

* cut: -d' ' uses space as the delimiter, -f3- for fields 3 to the end
* uniq: -d shows only duplicated lines (hashes)
* tail: -n+2 starts at line 2 and goes to the end (i.e., skips line 1)
* $(): sub-shell, legacy as backticks ``, but those are harder to read
and not nestable. I've nested here.
* for...done: Takes each hash and greps for it, then give you just the
file part


This is a good one, I'll add it to the second edition of the _bash
Cookbook_, if/when. Let me know how you make out.

Later,
JP
_____________
[1] Yes, there are better/more secure hashing tools than md5, but md5 is
almost always already there, esp. on Linux, and for this it doesn't matter.

----------------------------|:::======|-------------------------------
JP Vossen, CISSP |:::======| jp{at}jpsdomain{dot}org
My Account, My Opinions |=========| http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

 


Power up the Internet with Yahoo! Toolbar.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug