Rich Freeman on 7 Aug 2012 10:34:29 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] UNIX File Equivalence


On Tue, Aug 7, 2012 at 12:23 PM,  <bergman@merctech.com> wrote:
> I would argue very strongly that a file and a symlink to that file are
> NOT the same objext. Most ways of referencing those objects produce
> the same content, but they are not the same.

Yup, and that has given me a bit of pause.  I might actually treat
them as different.  I realize there is a separate system call that
doesn't dereference links.  I was actually more concerned with links
in the path than the final file being a link.

> I don't understand what you mean by "unique hash for a file" ... "without
> reading the file". Do you mean a hash of the path to a file? I'd suspect
> that that could get very messy. Ie., these may be equivalent in result,
> but not in hash:

The inode number works fine.  By hash I meant a unique ID for the file
- that is some function I could pass a path/filename into and quickly
get a value back, such that I always get the same value no matter how
I navigate to the file.  Hashes are good for quickly searching lists,
since a good hash function distributes values randomly which leads to
balanced trees and other good things.

> => just the file being the same, and reading content is slow anyway).
> => I'd expect this function to be called EXTREMELY often so it should run
>
> Can you give us any more info on the broader use here? Perhaps there's a
> better way of doing this.

I figured I'd save everybody the details, but I'm sure some would be
curious.  Anybody following my posts recently on gentoo-dev would
probably have put two and two together.  I'm looking to implement a QA
check in the Gentoo package manager that detects when a package build
system accesses a file that is not part of a declared dependency.
These kinds of undeclared dependencies can cause problems down the
road and modern build systems tend to let them creep in (automagic
detection of dependencies and such - something binary distros handle
via tools like cowbuilder).

Gentoo already has a sandbox mechanism for package builds.  This
mechanism intercepts system calls to file opens and blocks writes to
files outside of a defined area.  I'd like to extend this to blocking
reads to files a package shouldn't be using.

The performance issues are obvious - when building a package there
will be thousands of file access attempts, and there are thousands of
legitimate files that could be accessed.  Any check has to be very
fast to avoid becoming a bottleneck.

> Hmmm... "milliseconds" to search a list of that length? First of all, the
> "list" must be stored in memory by the search program (index by inode, if
> you're using that as the definitive test for whether file system objects are
> 'identical'). The problem is that you're not really going to search the
> list--the search occurs within in the filesystem:

Agreed, but there are some mitigating factors here:

1.  I can cache the device/inode numbers for everything on the list
I'm checking against, so I'm only checking that once.  I don't know if
device and inode ids are guaranteed to be constant for any particular
filesystem but if they are I might even be able to cache them across
many runs.  My list to search would be stored in memory.  Getting it
into memory in the first place would require io, but would only happen
once, and I suspect it would be pretty fast (certainly fast compared
to building a package).

2.  The only time I'd be checking a file is if the build system
attempted to open it.  So, any effort loading inodes into memory for
my check is going to save effort doing the same when the open attempt
is allowed.  The file read is still going to require seeking to the
actual extents so that will still take longer.

3.  All file accesses are going to be coming from lists elsewhere -
none of this requires systematic searching of trees and such so as
long as dir_index is implemented (or native to the filesystem as with
btrfs) loading the inodes should be pretty fast (I think ext4 now uses
btrees and such).

I guess the real performance constraint isn't that the check happens
quickly so much as it adds no significant overhead to opening the file
itself.

BTW, if anybody wants to learn how modern filesystems work, I learned
quite a bit reading up on the btrfs documentation.  It is pretty
impressive how much thought goes into cutting down the number of seeks
needed to get to a file.  One of the design constraints for btrfs was
that it be designed to handle VERY large filesystems such that the
metadata would not be expected to fit in ram (think petabytes to
exabytes of content, with gigabytes to terrabytes just for the
metadata).  On most desktop setups quite a bit of your filesystem data
ends up being cached which should make stat calls much faster.

Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug