Aaron Feng on 30 Oct 2012 08:04:01 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Intro / Question


Nice job Rich!

StarCluster looks interesting I should really check it out when I have time.
Just curious, how big is your cluster and how long does it take for
the job to finish?

Aaron

On Sun, Oct 28, 2012 at 10:04 AM, Rich Freeman <rich@thefreemanclan.net> wrote:
> On Monday, October 8, 2012 7:56:38 PM UTC-4, Rich Freeman wrote:
>>
>>
>> If anybody wants to follow along I'm keeping everything here:
>> git://github.com/rich0/gitvalidate.git
>>
>
> I figured I'd give a little update on how this is going, focusing more on
> the functional aspects.  I ended up implementing this using hadoop streaming
> with python map and reduce scripts.  I'm using starcluster with a custom ami
> that includes some python modules preinstalled.  I attach an ebs volume
> containing a copy of the git repo to each node.
>
> Map takes in one or more csv rows and for each outputs either the same row
> if it is a blob, and one row for each item in the next level of the tree if
> the input was a tree (discarding the parent tree row).  I'm using pygit2 to
> read the git repository - spawning git was just wasting too much time.  My
> csv rows use base64 for anything containing line breaks to keep things
> simple.
>
> Reduce takes the rows for a single file/tree, sorts them by timestamp, and
> then drops consecutive duplicate hashes.  That means that I'm only
> traversing the next tree level for entries that change, which is typically
> only one per commit until you get to the bottom of the tree.
>
> The only weakness I've detected in my actual algorithm is that doesn't
> detect file deletions.  I capture all blobs/trees that are present, and then
> drop ones that haven't changed from the previous commit.  Since the commit
> that drops a file doesn't contain its blob to begin with, it never gets
> captured.  To detect deletions I'd probably need to do pairwise comparisons.
> For now I plan to live with this.
>
> I'm not parallelizing the cvs side currently, but that should be trivial to
> parallelize since each file has a completely independent history.
>
> As far as practical results go - I did actually spot a bug in the converter
> that is mangling file headers.  I also spotted some files revisions with odd
> rcs revision numbers getting dropped.  The amount of data transformation
> during cvs->git conversion is greater than I had expected, which makes
> actually comparing the data harder.
>
> Oh, if anybody is aware of any decent visual diffing tools let me know. I've
> been using meld, but that tries to load the entire files into RAM and
> they're way to big for that, and its character-level diffing doesn't work
> well if there is no whitespace in the files.
>
> I post a bit more to the gentoo-scm list so feel to follow that if you're
> interested.  Thanks for the suggestions that were provided here.
>
> Rich