|
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
|
Re: [PLUG] finding duplicate files
|
> Date: Sat, 22 Nov 2008 17:43:24 -0500
> From: "K.S. Bhaskar" <bhaskar@bhaskars.com>
>
> Let's see if we can do it in 1 line... How about something like:
>
> find . -type f -exec md5sum {} \; | sort | uniq -d -w 32
I'd skip the find [1], since this does the same unless you need to
recurs into directories, which Art didn't say:
md5sum * | sort | uniq -d -w 32
But, this method does not work for me. At least, I don't think it does.
My solution gives you a list of files to delete, skipping the first of
the duplicates. This solution (either as written or my shorter one)
only gives the first duplicated line. So you either miss some if you
have more than 1 dup, or you have to keep running it until it doesn't
find anything else to delete.
'uniq -D' helps that, but then I don't see how to break out different
sets of dups so you can clean up the ones you don't want:
$ md5sum * | sort | uniq -D -w 32
484bade6c8b3c8147cc03728af90b096 dup1
484bade6c8b3c8147cc03728af90b096 orig1
c04e01c8718c20c983f6cbf6f07911f8 dup2.a
c04e01c8718c20c983f6cbf6f07911f8 dup2.b
c04e01c8718c20c983f6cbf6f07911f8 dup2.c
c04e01c8718c20c983f6cbf6f07911f8 dup2.d
c04e01c8718c20c983f6cbf6f07911f8 orig2
$ md5sum *
484bade6c8b3c8147cc03728af90b096 dup1
c04e01c8718c20c983f6cbf6f07911f8 dup2.a
c04e01c8718c20c983f6cbf6f07911f8 dup2.b
c04e01c8718c20c983f6cbf6f07911f8 dup2.c
c04e01c8718c20c983f6cbf6f07911f8 dup2.d
484bade6c8b3c8147cc03728af90b096 orig1
c04e01c8718c20c983f6cbf6f07911f8 orig2
c7da4fb9f3d537c45b12d3431ed21864 single1
c101e03d872787713f0d6ae169f616cb single2
$ md5sum * | sort | uniq -d -w 32
484bade6c8b3c8147cc03728af90b096 dup1
c04e01c8718c20c983f6cbf6f07911f8 dup2.a
$ md5sum * | sort | uniq -D -w 32
484bade6c8b3c8147cc03728af90b096 dup1
484bade6c8b3c8147cc03728af90b096 orig1
c04e01c8718c20c983f6cbf6f07911f8 dup2.a
c04e01c8718c20c983f6cbf6f07911f8 dup2.b
c04e01c8718c20c983f6cbf6f07911f8 dup2.c
c04e01c8718c20c983f6cbf6f07911f8 dup2.d
c04e01c8718c20c983f6cbf6f07911f8 orig2
I must admit I either had forgotten about or wasn't aware of the uniq -w
argument. That's very handy. And I noticed -D when re-reading the man
page. And I certainly tend to come up with complicated solutions.
Though in this case I started simple and kept adding testable layers
until I got a solution. :-)
Later,
JP
__________________________
[1] Note that 'find ... -exec {} \;' will swap a subshell for the exec
for each hit, which is very, very slow. To pick a more clear and common
example, never do this:
find ... -exec chmod 0775 {} \;
do this:
find ... -print0 | xargs -0 chmod 0775
The -print0 and -0 use NULL as the field separator, which works around
things like spaces in file/dir names.
----------------------------|:::======|-------------------------------
JP Vossen, CISSP |:::======| jp{at}jpsdomain{dot}org
My Account, My Opinions |=========| http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|