Rich Freeman on 3 Feb 2017 15:02:38 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] /. Microsoft Introduces GVFS (Git Virtual File System)


On Fri, Feb 3, 2017 at 2:54 PM, Walt Mankowski <waltman@pobox.com> wrote:
>
> At my lab we use git on windows for some medium-sized projects and
> have never had any performance issues. The numbers he quotes are
> really insane. NTFS has a reputation as having good performance for
> most uses, so it's hard to imaging that's the problem.  One area where
> git doesn't scale on any platform is with binary files. Maybe that's
> really their problem? Visual Studio projects tend to have lots of of
> binary files. Also, 3 million files (which I assume is versions of
> files) is an awful lot.
>

I've looked a bit at how git works with my work on the Gentoo git
migration/etc.  Gentoo actually has a pretty high commit rate, since
all our package bumps go into one git repo (so stabilizing one package
on one arch, or updating the version on a package causes a commit).
We also have a lot of little files.  However, git works just fine on
our scale.

3 million files is a lot.  I suspect they also have a long commit
history.  Now, when you do a git clone you don't HAVE to clone the
entire history, and avoiding that saves you considerable disk space
and time.  What you do need to do is checkout the entire tree at once
the way the standard client works, and I could see how that takes a
while.  Just doing a checkout of linux on a hard drive isn't terribly
fast, and that is only 56k files.  On an SSD and with a filesystem
that handles small files well the overhead would obviously be less.

Git is very efficient overall, but having to check out everything in
the tree necessarily incurs the overhead of creating all those files.
(The stuff in the history, or in a repo without a checkout, is
typically packed which improves performance and reduces space use.)

The concept behind their solution isn't a bad one.  Reading the trees
only without reading the blobs significantly reduces the number of IO
operations to identify all the files in the commit, especially if your
trees are relatively large (ie a few levels of very large directories
as opposed to many levels of tiny directories).  The only gotcha I see
is that without actually reading the blobs you wouldn't be able to
display the correct file sizes, as these aren't stored in the trees.

git itself is organized a bit like a filesystem, and indeed if you
understand git well you're well on your way to groking COW filesystems
in particular.

I'm sure you could also tweak the git client or make your own client
to just do partial checkouts, but I could see build systems getting
confused if they don't actually see the entirety of the repository.
This approach lets them see the repository without much overhead as
long as none of the files are actually read.

I wonder how it compares to gitfs on linux?  I don't think that it
fetches blobs only on demand.

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug