Re: [PLUG] finding duplicate files

David A. Harding on 23 Nov 2008 05:48:41 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] finding duplicate files

From: "David A. Harding" <dave@dtrt.org>

To: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>

Subject: Re: [PLUG] finding duplicate files

Date: Sun, 23 Nov 2008 08:48:23 -0500

Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>

Sender: plug-bounces@lists.phillylinux.org

User-agent: Mutt/1.5.18 (2008-05-17)

I know you encouraged a lot of testing to get here, but I think this is bad: On Sat, Nov 22, 2008 at 03:06:42PM -0500, JP Vossen wrote: > # When you are sure, replace echo with rm: > $ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); \ > do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done) A common example of a incorrectly removed file: $ echo "Hello, world" > foo # Make file foo $ touch bar # File bar not same as file foo $ cp foo 'bar mitzvah' # File bar mitzvah same as file foo $ ls -Q "foo" "bar" "bar mitzvah" $ md5sum * > /tmp/md5s bar bar mitzvah foo $ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done) rm: cannot remove `mitzvah': No such file or directory $ ls -Q "bar mitzvah" "foo" # Whoops, we removed file bar instead # of file bar mitzvah A unlikely but disastrous possibility: $ echo "Hello, world" > foo # Make file foo $ cp foo '*' # File * same as file foo $ ls -Q $ "*" "foo" $ md5sum * > /tmp/md5s $ rm $(for hash in $(cut -d' ' -f1 /tmp/md5s | sort | uniq -d); do echo; grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2; done) $ ls -Q # Everything has been deleted I suggest you use GNU rm's -- option when removing arbitrary filenames. This option prevents rm from interpreting filenames as command line options. For example, imagine removing a file named "-rf" [*]. I also suggest you use while-read loops for file names. Using the read builtin lets us work with whole lines. For example, a rewrite of J.P.'s code using a while-read loop follows: $ cut -d' ' -f1 /tmp/md5s | sort | uniq -d | while read hash ; do \ grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2 | while read duplicate_file ; \ do rm -- "$duplicate_file" ; done ; done Or in a somewhat more readable format: cut -d' ' -f1 /tmp/md5s | sort | uniq -d | while read hash do grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2 \ | while read duplicate_file do rm -- "$duplicate_file" done done Unless you plan on removing files names starting with a dash, I suggest you change the rm line to the following line: test -f "$duplicate_file" && rm -- "$duplicate_file" The test may catch something I didn't anticipate. > * $() :sub-shell, legacy as backticks ``, but those are harder to read > and not nestable. I've nested here. Technically, backtics are nestable (even in POSIX shell), but I'm pretty sure they can only be parsed by a computer. For example: $ echo `echo \`echo \\\`echo foo\\\` bar\` baz` quux foo bar baz quux -Dave [*] Argumentum ex concessis, I couldn't show a -rf example because J.P.'s md5sum glob tried using the -rf file as its command line arguments. -- David A. Harding Website: http://dtrt.org/ 1 (609) 997-0765 Email: dave@dtrt.org Jabber/XMPP: dharding@jabber.org ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

References:

Re: [PLUG] finding duplicate files
From: JP Vossen <jp@jpsdomain.org>

Prev by Date: Re: [PLUG] finding duplicate files

Next by Date: Re: [PLUG] finding duplicate files

Previous by thread: Re: [PLUG] finding duplicate files

Next by thread: Re: [PLUG] finding duplicate files

Index(es):

Date

Thread