|
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
|
Re: [PLUG] finding duplicate files
|
I know you encouraged a lot of testing to get here, but I think this is
bad:
On Sat, Nov 22, 2008 at 03:06:42PM -0500, JP Vossen wrote:
> # When you are sure, replace echo with rm:
> $ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \
> do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)
A common example of a incorrectly removed file:
$ echo "Hello, world" > foo # Make file foo
$ touch bar # File bar not same as file foo
$ cp foo 'bar mitzvah' # File bar mitzvah same as file foo
$ ls -Q
"foo" "bar" "bar mitzvah"
$ md5sum * > /tmp/md5s
bar bar mitzvah foo
$ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)
rm: cannot remove `mitzvah': No such file or directory
$ ls -Q
"bar mitzvah" "foo" # Whoops, we removed file bar instead
# of file bar mitzvah
A unlikely but disastrous possibility:
$ echo "Hello, world" > foo # Make file foo
$ cp foo '*' # File * same as file foo
$ ls -Q
$ "*" "foo"
$ md5sum * > /tmp/md5s
$ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done)
$ ls -Q
# Everything has been deleted
I suggest you use GNU rm's -- option when removing arbitrary filenames.
This option prevents rm from interpreting filenames as command line
options. For example, imagine removing a file named "-rf" [*].
I also suggest you use while-read loops for file names. Using the read
builtin lets us work with whole lines. For example, a rewrite of J.P.'s
code using a while-read loop follows:
$ cut -d' ' -f1 /tmp/md5s | sort | uniq -d | while read hash ; do \
grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2 | while read duplicate_file ; \
do rm -- "$duplicate_file" ; done ; done
Or in a somewhat more readable format:
cut -d' ' -f1 /tmp/md5s | sort | uniq -d | while read hash
do
grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2 \
| while read duplicate_file
do
rm -- "$duplicate_file"
done
done
Unless you plan on removing files names starting with a dash, I suggest
you change the rm line to the following line:
test -f "$duplicate_file" && rm -- "$duplicate_file"
The test may catch something I didn't anticipate.
> * $() :sub-shell, legacy as backticks ``, but those are harder to read
> and not nestable. I've nested here.
Technically, backtics are nestable (even in POSIX shell), but I'm pretty
sure they can only be parsed by a computer. For example:
$ echo `echo \`echo \\\`echo foo\\\` bar\` baz` quux
foo bar baz quux
-Dave
[*] Argumentum ex concessis, I couldn't show a -rf example because
J.P.'s md5sum glob tried using the -rf file as its command line
arguments.
--
David A. Harding Website: http://dtrt.org/
1 (609) 997-0765 Email: dave@dtrt.org
Jabber/XMPP: dharding@jabber.org
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|