JP Vossen on 23 Nov 2008 13:46:48 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] finding duplicate files


Date: Sun, 23 Nov 2008 08:48:23 -0500
 > From: "David A. Harding" <dave@dtrt.org>
 >
 > A common example of a incorrectly removed file:

  	<snip stuff with spaces>

 > A unlikely but disastrous possibility:

  	<snip>

Both good points, which is why I stressed testing.


 > I suggest you use GNU rm's -- option when removing arbitrary filenames.
 > This option prevents rm from interpreting filenames as command line
 > options. For example, imagine removing a file named "-rf" [*].

Good point.  See: 
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#How-do-I-remove-files-that-start-with-a-dash_003f


 > I also suggest you use while-read loops for file names. Using the read
 > builtin lets us work with whole lines. For example, a rewrite of J.P.'s
 > code using a while-read loop follows:
 >
 >     $ cut -d' ' -f1 /tmp/md5s | sort | uniq -d | while read hash ; do \
 >     grep "$hash" /tmp/md5s | cut -d' ' -f3- | tail -n+2 | while read 
duplicate_file ; \
 >     do rm -- "$duplicate_file" ; done ; done


Yeah, you got me there!  And I can't think of a way to handle spaces
using my method, other that this.  Good one.  Though I'd still TEST, 
TEST, TEST first, by replacing the 'rm' with 'echo'.


 > Unless you plan on removing files names starting with a dash, I
 > suggest you change the rm line to the following line:
 >
 > 	test -f "$duplicate_file" && rm -- "$duplicate_file"
 >
 > The test may catch something I didn't anticipate.

That's cool.  I'd write that like this, but they are the same, use 
whichever you like:
	[ -f "$duplicate_file" ] && echo -- "$duplicate_file"

So you get:
$ cut -d' ' -f1 /tmp/md5s | sort | uniq -d | \
while read hash ; do grep "$hash" /tmp/md5s|cut -d' ' -f3-|tail -n+2 | \
while read duplicate_file; \
do [ -f "$duplicate_file" ] && echo -- "$duplicate_file" ; done ; done


 >>> * $()	:sub-shell, legacy as backticks ``, but those are harder
 >>> to read and not nestable.  I've nested here.
 >
 > Technically, backtics are nestable (even in POSIX shell), but I'm
 > pretty sure they can only be parsed by a computer.  For example:
 >
 >     $ echo `echo \`echo \\\`echo foo\\\` bar\` baz` quux
 >     foo bar baz quux

Damn, got me again!  I'd have sworn they were not nestable (and have 
said that in various presos), but you are correct.  I can trace my 
knowledge of that issue to the footnote on page 100 of _Learning the 
bash Shell 3rd_, but I'd read or remembered it wrong.  It says "less 
conducive to nesting" but I'd remembered that as not possible.


Thanks for the great catches, hope I didn't good Art up,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|        jp{at}jpsdomain{dot}org
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug