bergman on 7 Aug 2012 09:23:27 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] UNIX File Equivalence |
In the message dated: Tue, 07 Aug 2012 11:18:03 EDT, The pithy ruminations from Rich Freeman on <[PLUG] UNIX File Equivalence> were: => Here is a good general UNIX question that somebody might know the answer to. => => On UNIX, or at least on Linux, how can I determine if two files are => the same, preferably very quickly (ie via a system/library call and man 3 stat It is available (on many Linux distros) as a command-line tool, and in all(?!) POSIX environments as a system call. [SNIP!] => => For example, on my system right now the files /usr/tmp/xyz and => /var/tmp/xyz are actually the same file (due to symlinks). Other I would argue very strongly that a file and a symlink to that file are NOT the same objext. Most ways of referencing those objects produce the same content, but they are not the same. There are many instances (security exploits, badly written code, the stat() call) that distinguish between a symlink to an object and the object itself. => mechanisms that could cause files to be the same would be hard links Yes, multiple hard links to the same object are "the same". [SNIP!] => => My use case is a C function which takes as input a file and needs to => determine if that file is one of the files in some list, regardless of => how the path takes it there. So, solutions that can also come up with => a deterministic canonical path for a file (even with bind mounts) or The inode number returned by stat() would meet that case. => some kind of unique hash for a file would be even better (without => reading the file - again I don't care about content being the same - I don't understand what you mean by "unique hash for a file" ... "without reading the file". Do you mean a hash of the path to a file? I'd suspect that that could get very messy. Ie., these may be equivalent in result, but not in hash: sum ~/.bashrc sum ~joe/../sam/../mary/../fred/../mark/.bashrc ln ~paul /tmp/paul ; sum /tmp/paul/../harry/../mark/.bashrc => just the file being the same, and reading content is slow anyway). => I'd expect this function to be called EXTREMELY often so it should run Can you give us any more info on the broader use here? Perhaps there's a better way of doing this. If you're thinking of doing some kind of real-time deduplication perhaps the best way to do this is not through a "search" but by building this logic into the filesystem--maybe through a user-land filesystem. => on the order of milliseconds with a search list containing tens of => thousands of path/filenames. If I had to exhaustively fully test the Hmmm... "milliseconds" to search a list of that length? First of all, the "list" must be stored in memory by the search program (index by inode, if you're using that as the definitive test for whether file system objects are 'identical'). The problem is that you're not really going to search the list--the search occurs within in the filesystem: look up information about the specified filesystem object via stat() if the object is a symlink, repeat until all links are deferenced* look up the inode number returned by stat() in your table of inode numbers...if it exists and the object (filename) is not the same as the input filename, then you have a duplicate I suspect that the performance will depend almost entirely on your I/O configuration--the overhead in doing stat() on each object, not on the algorithm within stat() or the comparison to your table of filenames. If you have massive filesystem caching (metadata, at a minimum, content as well to meet the goal of doing file hashing) or you're using SSD storage, you might make this performance. Maybe. If the system is otherwise idle and/or well designed. If you're waiting for rust-coated platters to slowly spin underneath some magnets, it's not going to happen. Mark * My stat(3) man page states: If the named file is a symbolic link, the stat() function shall continue pathname resolution using the contents of the symbolic link, and shall return information pertaining to the resulting file if the file exists. However, it's not clear to me whether stat() is recursive or not. If there's a chain of symlinks (directories or file objects), will a stat(3) of one object walk the chain to the end, returning information about the final target, or just the next object in the chain? => => Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug