JP Vossen on 23 Nov 2008 12:54:00 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] finding duplicate files


> Date: Sat, 22 Nov 2008 17:43:24 -0500
> From: "K.S. Bhaskar" <bhaskar@bhaskars.com>
> 
> Let's see if we can do it in 1 line...  How about something like:
> 
>   find . -type f -exec md5sum {} \; | sort | uniq -d -w 32

I'd skip the find [1], since this does the same unless you need to 
recurs into directories, which Art didn't say:
	md5sum * | sort | uniq -d -w 32


But, this method does not work for me.  At least, I don't think it does. 
  My solution gives you a list of files to delete, skipping the first of 
the duplicates.  This solution (either as written or my shorter one) 
only gives the first duplicated line.  So you either miss some if you 
have more than 1 dup, or you have to keep running it until it doesn't 
find anything else to delete.

'uniq -D' helps that, but then I don't see how to break out different 
sets of dups so you can clean up the ones you don't want:
$ md5sum * | sort | uniq -D -w 32
	484bade6c8b3c8147cc03728af90b096  dup1
	484bade6c8b3c8147cc03728af90b096  orig1
	c04e01c8718c20c983f6cbf6f07911f8  dup2.a
	c04e01c8718c20c983f6cbf6f07911f8  dup2.b
	c04e01c8718c20c983f6cbf6f07911f8  dup2.c
	c04e01c8718c20c983f6cbf6f07911f8  dup2.d
	c04e01c8718c20c983f6cbf6f07911f8  orig2
	
	$ md5sum *
	484bade6c8b3c8147cc03728af90b096  dup1
	c04e01c8718c20c983f6cbf6f07911f8  dup2.a
	c04e01c8718c20c983f6cbf6f07911f8  dup2.b
	c04e01c8718c20c983f6cbf6f07911f8  dup2.c
	c04e01c8718c20c983f6cbf6f07911f8  dup2.d
	484bade6c8b3c8147cc03728af90b096  orig1
	c04e01c8718c20c983f6cbf6f07911f8  orig2
	c7da4fb9f3d537c45b12d3431ed21864  single1
	c101e03d872787713f0d6ae169f616cb  single2
	
	$ md5sum * | sort | uniq -d -w 32
	484bade6c8b3c8147cc03728af90b096  dup1
	c04e01c8718c20c983f6cbf6f07911f8  dup2.a
	
	$ md5sum * | sort | uniq -D -w 32
	484bade6c8b3c8147cc03728af90b096  dup1
	484bade6c8b3c8147cc03728af90b096  orig1
	c04e01c8718c20c983f6cbf6f07911f8  dup2.a
	c04e01c8718c20c983f6cbf6f07911f8  dup2.b
	c04e01c8718c20c983f6cbf6f07911f8  dup2.c
	c04e01c8718c20c983f6cbf6f07911f8  dup2.d
	c04e01c8718c20c983f6cbf6f07911f8  orig2


I must admit I either had forgotten about or wasn't aware of the uniq -w 
argument.  That's very handy.  And I noticed -D when re-reading the man 
page.  And I certainly tend to come up with complicated solutions. 
Though in this case I started simple and kept adding testable layers 
until I got a solution. :-)


Later,
JP

__________________________
[1] Note that 'find ... -exec {} \;' will swap a subshell for the exec 
for each hit, which is very, very slow.  To pick a more clear and common 
example, never do this:
	find ... -exec chmod 0775 {} \;
do this:
	find ... -print0 | xargs -0 chmod 0775

The -print0 and -0 use NULL as the field separator, which works around 
things like spaces in file/dir names.

----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|        jp{at}jpsdomain{dot}org
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug