JP Vossen on 22 Nov 2008 12:06:51 -0800 |
> Date: Sat, 22 Nov 2008 13:32:17 -0500 > From: Art Alexion <art.alexion@gmail.com> > > I have a directory with a lot of files, a number of which are > identical, except for filename. What is the most efficient way to > find (and ultimately delete) the duplicates? How about this? TEST, TEST, TEST first! # Assumes a recent version of bash [for nested $()] # BACKUP, then capture md5 [1] hashes (don't put the output file in your CWD or you may recurs!) $ cd /path/to/dir $ cd .. $ cp -a dir dir.BACKUP $ cd dir $ md5sum * > /tmp/md5s # Display a list of the duplicate hashes $ cut -d' ' -f1 /tmp/md5s | sort | uniq -d # Display a list of ALL the duplicate files (don't delete these!) $ for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; \ grep "$hash" /tmp/md5s | cut -d' ' -f3-; done # Now you have a blank line separated list of dups, and you can choose which to keep and which to nuke. One way is to keep the first and nuke the others: $ for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; \ grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done # TEST THIS using echo! ARE YOU SURE? Verify that the names of files you want to keep are not listed! $ echo $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \ do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done) # When you are sure, replace echo with rm: $ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \ do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done) ~~~~~~~~~~~~~~~~~~~~~~ Interesting commands: * cut: -d' ' uses space as the delimiter, -f3- for fields 3 to the end * uniq: -d shows only duplicated lines (hashes) * tail: -n+2 starts at line 2 and goes to the end (i.e., skips line 1) * $(): sub-shell, legacy as backticks ``, but those are harder to read and not nestable. I've nested here. * for...done: Takes each hash and greps for it, then give you just the file part This is a good one, I'll add it to the second edition of the _bash Cookbook_, if/when. Let me know how you make out. Later, JP _____________ [1] Yes, there are better/more secure hashing tools than md5, but md5 is almost always already there, esp. on Linux, and for this it doesn't matter. ----------------------------|:::======|------------------------------- JP Vossen, CISSP |:::======| jp{at}jpsdomain{dot}org My Account, My Opinions |=========| http://www.jpsdomain.org/ ----------------------------|=========|------------------------------- "Microsoft Tax" = the additional hardware & yearly fees for the add-on software required to protect Windows from its own poorly designed and implemented self, while the overhead incidentally flattens Moore's Law. ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|