Rich Freeman on 28 Oct 2012 07:04:18 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Intro / Question


On Monday, October 8, 2012 7:56:38 PM UTC-4, Rich Freeman wrote:

If anybody wants to follow along I'm keeping everything here:
git://github.com/rich0/gitvalidate.git


I figured I'd give a little update on how this is going, focusing more on the functional aspects.  I ended up implementing this using hadoop streaming with python map and reduce scripts.  I'm using starcluster with a custom ami that includes some python modules preinstalled.  I attach an ebs volume containing a copy of the git repo to each node.

Map takes in one or more csv rows and for each outputs either the same row if it is a blob, and one row for each item in the next level of the tree if the input was a tree (discarding the parent tree row).  I'm using pygit2 to read the git repository - spawning git was just wasting too much time.  My csv rows use base64 for anything containing line breaks to keep things simple.

Reduce takes the rows for a single file/tree, sorts them by timestamp, and then drops consecutive duplicate hashes.  That means that I'm only traversing the next tree level for entries that change, which is typically only one per commit until you get to the bottom of the tree.  

The only weakness I've detected in my actual algorithm is that doesn't detect file deletions.  I capture all blobs/trees that are present, and then drop ones that haven't changed from the previous commit.  Since the commit that drops a file doesn't contain its blob to begin with, it never gets captured.  To detect deletions I'd probably need to do pairwise comparisons.  For now I plan to live with this.

I'm not parallelizing the cvs side currently, but that should be trivial to parallelize since each file has a completely independent history.  

As far as practical results go - I did actually spot a bug in the converter that is mangling file headers.  I also spotted some files revisions with odd rcs revision numbers getting dropped.  The amount of data transformation during cvs->git conversion is greater than I had expected, which makes actually comparing the data harder.

Oh, if anybody is aware of any decent visual diffing tools let me know. I've been using meld, but that tries to load the entire files into RAM and they're way to big for that, and its character-level diffing doesn't work well if there is no whitespace in the files.

I post a bit more to the gentoo-scm list so feel to follow that if you're interested.  Thanks for the suggestions that were provided here.

Rich