Paul Jungwirth on 7 Aug 2012 11:40:59 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] UNIX File Equivalence


It sounds like using the inode number solves the equivalence problem,
but if you want to find a canonical path to a file (e.g. for display),
the readlink(1) command is very helpful.

Paul


On Tue, Aug 7, 2012 at 10:34 AM, Rich Freeman <r-plug@thefreemanclan.net> wrote:
> On Tue, Aug 7, 2012 at 12:23 PM,  <bergman@merctech.com> wrote:
>> I would argue very strongly that a file and a symlink to that file are
>> NOT the same objext. Most ways of referencing those objects produce
>> the same content, but they are not the same.
>
> Yup, and that has given me a bit of pause.  I might actually treat
> them as different.  I realize there is a separate system call that
> doesn't dereference links.  I was actually more concerned with links
> in the path than the final file being a link.
>
>> I don't understand what you mean by "unique hash for a file" ... "without
>> reading the file". Do you mean a hash of the path to a file? I'd suspect
>> that that could get very messy. Ie., these may be equivalent in result,
>> but not in hash:
>
> The inode number works fine.  By hash I meant a unique ID for the file
> - that is some function I could pass a path/filename into and quickly
> get a value back, such that I always get the same value no matter how
> I navigate to the file.  Hashes are good for quickly searching lists,
> since a good hash function distributes values randomly which leads to
> balanced trees and other good things.
>
>> => just the file being the same, and reading content is slow anyway).
>> => I'd expect this function to be called EXTREMELY often so it should run
>>
>> Can you give us any more info on the broader use here? Perhaps there's a
>> better way of doing this.
>
> I figured I'd save everybody the details, but I'm sure some would be
> curious.  Anybody following my posts recently on gentoo-dev would
> probably have put two and two together.  I'm looking to implement a QA
> check in the Gentoo package manager that detects when a package build
> system accesses a file that is not part of a declared dependency.
> These kinds of undeclared dependencies can cause problems down the
> road and modern build systems tend to let them creep in (automagic
> detection of dependencies and such - something binary distros handle
> via tools like cowbuilder).
>
> Gentoo already has a sandbox mechanism for package builds.  This
> mechanism intercepts system calls to file opens and blocks writes to
> files outside of a defined area.  I'd like to extend this to blocking
> reads to files a package shouldn't be using.
>
> The performance issues are obvious - when building a package there
> will be thousands of file access attempts, and there are thousands of
> legitimate files that could be accessed.  Any check has to be very
> fast to avoid becoming a bottleneck.
>
>> Hmmm... "milliseconds" to search a list of that length? First of all, the
>> "list" must be stored in memory by the search program (index by inode, if
>> you're using that as the definitive test for whether file system objects are
>> 'identical'). The problem is that you're not really going to search the
>> list--the search occurs within in the filesystem:
>
> Agreed, but there are some mitigating factors here:
>
> 1.  I can cache the device/inode numbers for everything on the list
> I'm checking against, so I'm only checking that once.  I don't know if
> device and inode ids are guaranteed to be constant for any particular
> filesystem but if they are I might even be able to cache them across
> many runs.  My list to search would be stored in memory.  Getting it
> into memory in the first place would require io, but would only happen
> once, and I suspect it would be pretty fast (certainly fast compared
> to building a package).
>
> 2.  The only time I'd be checking a file is if the build system
> attempted to open it.  So, any effort loading inodes into memory for
> my check is going to save effort doing the same when the open attempt
> is allowed.  The file read is still going to require seeking to the
> actual extents so that will still take longer.
>
> 3.  All file accesses are going to be coming from lists elsewhere -
> none of this requires systematic searching of trees and such so as
> long as dir_index is implemented (or native to the filesystem as with
> btrfs) loading the inodes should be pretty fast (I think ext4 now uses
> btrees and such).
>
> I guess the real performance constraint isn't that the check happens
> quickly so much as it adds no significant overhead to opening the file
> itself.
>
> BTW, if anybody wants to learn how modern filesystems work, I learned
> quite a bit reading up on the btrfs documentation.  It is pretty
> impressive how much thought goes into cutting down the number of seeks
> needed to get to a file.  One of the design constraints for btrfs was
> that it be designed to handle VERY large filesystems such that the
> metadata would not be expected to fit in ram (think petabytes to
> exabytes of content, with gigabytes to terrabytes just for the
> metadata).  On most desktop setups quite a bit of your filesystem data
> ends up being cached which should make stat calls much faster.
>
> Rich
> ___________________________________________________________________________
> Philadelphia Linux Users Group         --        http://www.phillylinux.org
> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
> General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug



-- 
_________________________________
Pulchritudo splendor veritatis.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug