bergman on 7 Aug 2012 09:23:27 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] UNIX File Equivalence


In the message dated: Tue, 07 Aug 2012 11:18:03 EDT,
The pithy ruminations from Rich Freeman on 
<[PLUG] UNIX File Equivalence> were:
=> Here is a good general UNIX question that somebody might know the answer to.
=> 
=> On UNIX, or at least on Linux, how can I determine if two files are
=> the same, preferably very quickly (ie via a system/library call and

	man 3 stat

It is available (on many Linux distros) as a command-line tool,
and in all(?!) POSIX environments as a system call.

	[SNIP!]

=> 
=> For example, on my system right now the files /usr/tmp/xyz and
=> /var/tmp/xyz are actually the same file (due to symlinks).  Other

I would argue very strongly that a file and a symlink to that file are
NOT the same objext. Most ways of referencing those objects produce
the same content, but they are not the same.

There are many instances (security exploits, badly written code, the
stat() call) that distinguish between a symlink to an object and the
object itself.

=> mechanisms that could cause files to be the same would be hard links

Yes, multiple hard links to the same object are "the same".

	[SNIP!]

=> 
=> My use case is a C function which takes as input a file and needs to
=> determine if that file is one of the files in some list, regardless of
=> how the path takes it there.  So, solutions that can also come up with
=> a deterministic canonical path for a file (even with bind mounts) or

The inode number returned by stat() would meet that case.

=> some kind of unique hash for a file would be even better (without
=> reading the file - again I don't care about content being the same -

I don't understand what you mean by "unique hash for a file" ... "without
reading the file". Do you mean a hash of the path to a file? I'd suspect
that that could get very messy. Ie., these may be equivalent in result,
but not in hash:

	sum ~/.bashrc
	sum ~joe/../sam/../mary/../fred/../mark/.bashrc
	ln ~paul /tmp/paul ; sum /tmp/paul/../harry/../mark/.bashrc


=> just the file being the same, and reading content is slow anyway).
=> I'd expect this function to be called EXTREMELY often so it should run

Can you give us any more info on the broader use here? Perhaps there's a
better way of doing this.

If you're thinking of doing some kind of real-time deduplication perhaps
the best way to do this is not through a "search" but by building this
logic into the filesystem--maybe through a user-land filesystem.

=> on the order of milliseconds with a search list containing tens of
=> thousands of path/filenames.  If I had to exhaustively fully test the

Hmmm... "milliseconds" to search a list of that length? First of all, the
"list" must be stored in memory by the search program (index by inode, if
you're using that as the definitive test for whether file system objects are
'identical'). The problem is that you're not really going to search the
list--the search occurs within in the filesystem:

	look up information about the specified filesystem object
	via stat()
		if the object is a symlink, repeat until all links
		are deferenced*

	look up the inode number returned by stat() in your table of
	inode numbers...if it exists and the object (filename) is not
	the same as the input filename, then you have a duplicate


I suspect that the performance will depend almost entirely on your I/O
configuration--the overhead in doing stat() on each object, not on the
algorithm within stat() or the comparison to your table of filenames. If
you have massive filesystem caching (metadata, at a minimum, content as
well to meet the goal of doing file hashing) or you're using SSD storage,
you might make this performance. Maybe. If the system is otherwise idle
and/or well designed.

If you're waiting for rust-coated platters to slowly spin underneath some
magnets, it's not going to happen.

Mark


	* My stat(3) man page states:

		 If the named file is a symbolic link, the stat() function
		 shall continue pathname resolution using the contents
		 of the symbolic link, and shall return information
		 pertaining to the resulting file if the file exists.

	   However, it's not clear to me whether stat() is recursive
	   or not. If there's a chain of symlinks (directories or file
	   objects), will a stat(3) of one object walk the chain to the
	   end, returning information about the final target, or just
	   the next object in the chain?


=> 
=> Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug