Rich Freeman on 3 Feb 2017 15:02:38 -0800 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] /. Microsoft Introduces GVFS (Git Virtual File System) |
On Fri, Feb 3, 2017 at 2:54 PM, Walt Mankowski <waltman@pobox.com> wrote: > > At my lab we use git on windows for some medium-sized projects and > have never had any performance issues. The numbers he quotes are > really insane. NTFS has a reputation as having good performance for > most uses, so it's hard to imaging that's the problem. One area where > git doesn't scale on any platform is with binary files. Maybe that's > really their problem? Visual Studio projects tend to have lots of of > binary files. Also, 3 million files (which I assume is versions of > files) is an awful lot. > I've looked a bit at how git works with my work on the Gentoo git migration/etc. Gentoo actually has a pretty high commit rate, since all our package bumps go into one git repo (so stabilizing one package on one arch, or updating the version on a package causes a commit). We also have a lot of little files. However, git works just fine on our scale. 3 million files is a lot. I suspect they also have a long commit history. Now, when you do a git clone you don't HAVE to clone the entire history, and avoiding that saves you considerable disk space and time. What you do need to do is checkout the entire tree at once the way the standard client works, and I could see how that takes a while. Just doing a checkout of linux on a hard drive isn't terribly fast, and that is only 56k files. On an SSD and with a filesystem that handles small files well the overhead would obviously be less. Git is very efficient overall, but having to check out everything in the tree necessarily incurs the overhead of creating all those files. (The stuff in the history, or in a repo without a checkout, is typically packed which improves performance and reduces space use.) The concept behind their solution isn't a bad one. Reading the trees only without reading the blobs significantly reduces the number of IO operations to identify all the files in the commit, especially if your trees are relatively large (ie a few levels of very large directories as opposed to many levels of tiny directories). The only gotcha I see is that without actually reading the blobs you wouldn't be able to display the correct file sizes, as these aren't stored in the trees. git itself is organized a bit like a filesystem, and indeed if you understand git well you're well on your way to groking COW filesystems in particular. I'm sure you could also tweak the git client or make your own client to just do partial checkouts, but I could see build systems getting confused if they don't actually see the entirety of the repository. This approach lets them see the repository without much overhead as long as none of the files are actually read. I wonder how it compares to gitfs on linux? I don't think that it fetches blobs only on demand. -- Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug