Paul Jungwirth on 7 Aug 2012 11:40:59 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] UNIX File Equivalence |
It sounds like using the inode number solves the equivalence problem, but if you want to find a canonical path to a file (e.g. for display), the readlink(1) command is very helpful. Paul On Tue, Aug 7, 2012 at 10:34 AM, Rich Freeman <r-plug@thefreemanclan.net> wrote: > On Tue, Aug 7, 2012 at 12:23 PM, <bergman@merctech.com> wrote: >> I would argue very strongly that a file and a symlink to that file are >> NOT the same objext. Most ways of referencing those objects produce >> the same content, but they are not the same. > > Yup, and that has given me a bit of pause. I might actually treat > them as different. I realize there is a separate system call that > doesn't dereference links. I was actually more concerned with links > in the path than the final file being a link. > >> I don't understand what you mean by "unique hash for a file" ... "without >> reading the file". Do you mean a hash of the path to a file? I'd suspect >> that that could get very messy. Ie., these may be equivalent in result, >> but not in hash: > > The inode number works fine. By hash I meant a unique ID for the file > - that is some function I could pass a path/filename into and quickly > get a value back, such that I always get the same value no matter how > I navigate to the file. Hashes are good for quickly searching lists, > since a good hash function distributes values randomly which leads to > balanced trees and other good things. > >> => just the file being the same, and reading content is slow anyway). >> => I'd expect this function to be called EXTREMELY often so it should run >> >> Can you give us any more info on the broader use here? Perhaps there's a >> better way of doing this. > > I figured I'd save everybody the details, but I'm sure some would be > curious. Anybody following my posts recently on gentoo-dev would > probably have put two and two together. I'm looking to implement a QA > check in the Gentoo package manager that detects when a package build > system accesses a file that is not part of a declared dependency. > These kinds of undeclared dependencies can cause problems down the > road and modern build systems tend to let them creep in (automagic > detection of dependencies and such - something binary distros handle > via tools like cowbuilder). > > Gentoo already has a sandbox mechanism for package builds. This > mechanism intercepts system calls to file opens and blocks writes to > files outside of a defined area. I'd like to extend this to blocking > reads to files a package shouldn't be using. > > The performance issues are obvious - when building a package there > will be thousands of file access attempts, and there are thousands of > legitimate files that could be accessed. Any check has to be very > fast to avoid becoming a bottleneck. > >> Hmmm... "milliseconds" to search a list of that length? First of all, the >> "list" must be stored in memory by the search program (index by inode, if >> you're using that as the definitive test for whether file system objects are >> 'identical'). The problem is that you're not really going to search the >> list--the search occurs within in the filesystem: > > Agreed, but there are some mitigating factors here: > > 1. I can cache the device/inode numbers for everything on the list > I'm checking against, so I'm only checking that once. I don't know if > device and inode ids are guaranteed to be constant for any particular > filesystem but if they are I might even be able to cache them across > many runs. My list to search would be stored in memory. Getting it > into memory in the first place would require io, but would only happen > once, and I suspect it would be pretty fast (certainly fast compared > to building a package). > > 2. The only time I'd be checking a file is if the build system > attempted to open it. So, any effort loading inodes into memory for > my check is going to save effort doing the same when the open attempt > is allowed. The file read is still going to require seeking to the > actual extents so that will still take longer. > > 3. All file accesses are going to be coming from lists elsewhere - > none of this requires systematic searching of trees and such so as > long as dir_index is implemented (or native to the filesystem as with > btrfs) loading the inodes should be pretty fast (I think ext4 now uses > btrees and such). > > I guess the real performance constraint isn't that the check happens > quickly so much as it adds no significant overhead to opening the file > itself. > > BTW, if anybody wants to learn how modern filesystems work, I learned > quite a bit reading up on the btrfs documentation. It is pretty > impressive how much thought goes into cutting down the number of seeks > needed to get to a file. One of the design constraints for btrfs was > that it be designed to handle VERY large filesystems such that the > metadata would not be expected to fit in ram (think petabytes to > exabytes of content, with gigabytes to terrabytes just for the > metadata). On most desktop setups quite a bit of your filesystem data > ends up being cached which should make stat calls much faster. > > Rich > ___________________________________________________________________________ > Philadelphia Linux Users Group -- http://www.phillylinux.org > Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce > General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug -- _________________________________ Pulchritudo splendor veritatis. ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug