David A. Harding on 23 Nov 2008 16:02:44 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] finding duplicate files


On Sun, Nov 23, 2008 at 06:00:00PM -0500, Matthew Rosewarne wrote:
> Instead of hacking together some script, just use finddup from the
> "perforate" package. 

I agree with Mr. Rosewarne: using an existing command is probably the
best solution.  finddup is written in perl, which saves it from most of
bash's filename quirks, and it uses the same basic method J.P. and I
used:

    1. Get a list of files
    2. Look at the file size (J.P. and I didn't do this)
    3. Compute MD5 checksum for files with the same file size
    4. Remove files with the same file size

Step two makes finddup run a lot faster on large files than J.P. or my
code will and also adds a statistically insignificant amount of extra
protection against accidental deletions: two files can have different
contents but share a MD5 checksum; if that happens, they probably won't
share the same file size, so finddup won't delete them.

But step two also means finddup won't find a duplicate file if the
original file is sparse and the duplicate is filled, or vice versa. I
find that deliciously ironic for a program in the perforate package. :)

A possible disadvantage of finddup is that its error messages are
written in German.

-Dave
-- 
David A. Harding	    Website:  http://dtrt.org/
1 (609) 997-0765	      Email:  dave@dtrt.org
			Jabber/XMPP:  dharding@jabber.org
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug