Rich Freeman on 7 Aug 2012 10:34:29 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] UNIX File Equivalence |
On Tue, Aug 7, 2012 at 12:23 PM, <bergman@merctech.com> wrote: > I would argue very strongly that a file and a symlink to that file are > NOT the same objext. Most ways of referencing those objects produce > the same content, but they are not the same. Yup, and that has given me a bit of pause. I might actually treat them as different. I realize there is a separate system call that doesn't dereference links. I was actually more concerned with links in the path than the final file being a link. > I don't understand what you mean by "unique hash for a file" ... "without > reading the file". Do you mean a hash of the path to a file? I'd suspect > that that could get very messy. Ie., these may be equivalent in result, > but not in hash: The inode number works fine. By hash I meant a unique ID for the file - that is some function I could pass a path/filename into and quickly get a value back, such that I always get the same value no matter how I navigate to the file. Hashes are good for quickly searching lists, since a good hash function distributes values randomly which leads to balanced trees and other good things. > => just the file being the same, and reading content is slow anyway). > => I'd expect this function to be called EXTREMELY often so it should run > > Can you give us any more info on the broader use here? Perhaps there's a > better way of doing this. I figured I'd save everybody the details, but I'm sure some would be curious. Anybody following my posts recently on gentoo-dev would probably have put two and two together. I'm looking to implement a QA check in the Gentoo package manager that detects when a package build system accesses a file that is not part of a declared dependency. These kinds of undeclared dependencies can cause problems down the road and modern build systems tend to let them creep in (automagic detection of dependencies and such - something binary distros handle via tools like cowbuilder). Gentoo already has a sandbox mechanism for package builds. This mechanism intercepts system calls to file opens and blocks writes to files outside of a defined area. I'd like to extend this to blocking reads to files a package shouldn't be using. The performance issues are obvious - when building a package there will be thousands of file access attempts, and there are thousands of legitimate files that could be accessed. Any check has to be very fast to avoid becoming a bottleneck. > Hmmm... "milliseconds" to search a list of that length? First of all, the > "list" must be stored in memory by the search program (index by inode, if > you're using that as the definitive test for whether file system objects are > 'identical'). The problem is that you're not really going to search the > list--the search occurs within in the filesystem: Agreed, but there are some mitigating factors here: 1. I can cache the device/inode numbers for everything on the list I'm checking against, so I'm only checking that once. I don't know if device and inode ids are guaranteed to be constant for any particular filesystem but if they are I might even be able to cache them across many runs. My list to search would be stored in memory. Getting it into memory in the first place would require io, but would only happen once, and I suspect it would be pretty fast (certainly fast compared to building a package). 2. The only time I'd be checking a file is if the build system attempted to open it. So, any effort loading inodes into memory for my check is going to save effort doing the same when the open attempt is allowed. The file read is still going to require seeking to the actual extents so that will still take longer. 3. All file accesses are going to be coming from lists elsewhere - none of this requires systematic searching of trees and such so as long as dir_index is implemented (or native to the filesystem as with btrfs) loading the inodes should be pretty fast (I think ext4 now uses btrees and such). I guess the real performance constraint isn't that the check happens quickly so much as it adds no significant overhead to opening the file itself. BTW, if anybody wants to learn how modern filesystems work, I learned quite a bit reading up on the btrfs documentation. It is pretty impressive how much thought goes into cutting down the number of seeks needed to get to a file. One of the design constraints for btrfs was that it be designed to handle VERY large filesystems such that the metadata would not be expected to fit in ram (think petabytes to exabytes of content, with gigabytes to terrabytes just for the metadata). On most desktop setups quite a bit of your filesystem data ends up being cached which should make stat calls much faster. Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug