David A. Harding on 23 Nov 2008 05:48:41 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] finding duplicate files


I know you encouraged a lot of testing to get here, but I think this is
bad:

On Sat, Nov 22, 2008 at 03:06:42PM -0500, JP Vossen wrote:
> # When you are sure, replace echo with rm:
> $ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \
>      do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)

A common example of a incorrectly removed file:

    $ echo "Hello, world" > foo	    # Make file foo
    $ touch bar 		    # File bar not same as file foo
    $ cp foo 'bar mitzvah'	    # File bar mitzvah same as file foo
    $ ls -Q
    "foo"  "bar"  "bar mitzvah"
    $ md5sum * > /tmp/md5s                                            
    bar  bar mitzvah  foo
    $ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done) 
    rm: cannot remove `mitzvah': No such file or directory
    $ ls -Q
    "bar mitzvah"  "foo"	    # Whoops, we removed file bar instead 
				    #   of file bar mitzvah

A unlikely but disastrous possibility:

    $ echo "Hello, world" > foo	    # Make file foo
    $ cp foo '*'		    # File * same as file foo
    $ ls -Q
    $ "*"  "foo"
    $ md5sum * > /tmp/md5s                                            
    $ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)
    $ ls -Q
				    # Everything has been deleted

I suggest you use GNU rm's -- option when removing arbitrary filenames.
This option prevents rm from interpreting filenames as command line
options. For example, imagine removing a file named "-rf" [*].

I also suggest you use while-read loops for file names. Using the read
builtin lets us work with whole lines. For example, a rewrite of J.P.'s
code using a while-read loop follows:

    $ cut -d' ' -f1 /tmp/md5s | sort | uniq -d | while read hash ; do \
    grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2 | while read duplicate_file ; \
    do rm -- "$duplicate_file" ; done ; done

Or in a somewhat more readable format:

cut -d' ' -f1 /tmp/md5s | sort | uniq -d | while read hash
do
    grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2 \
    | while read duplicate_file
    do 
	rm -- "$duplicate_file"
    done
done

Unless you plan on removing files names starting with a dash, I suggest
you change the rm line to the following line:

	test -f "$duplicate_file" && rm -- "$duplicate_file"

The test may catch something I didn't anticipate.

> * $()	:sub-shell, legacy as backticks ``, but those are harder to read 
> and not nestable.  I've nested here.

Technically, backtics are nestable (even in POSIX shell), but I'm pretty
sure they can only be parsed by a computer.  For example:

    $ echo `echo \`echo \\\`echo foo\\\` bar\` baz` quux
    foo bar baz quux

-Dave

[*] Argumentum ex concessis, I couldn't show a -rf example because
    J.P.'s md5sum glob tried using the -rf file as its command line
    arguments.
-- 
David A. Harding	    Website:  http://dtrt.org/
1 (609) 997-0765	      Email:  dave@dtrt.org
			Jabber/XMPP:  dharding@jabber.org
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug