Re: Intro / Question

Aaron Feng on 30 Oct 2012 08:04:01 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Intro / Question

From: Aaron Feng <aaron.feng@gmail.com>
To: philly-lambda@googlegroups.com
Subject: Re: Intro / Question
Date: Tue, 30 Oct 2012 11:03:09 -0400
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=x-beenthere:received-spf:mime-version:in-reply-to:references:from :date:message-id:subject:to:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-google-group-id:list-post:list-help:list-archive:sender :list-unsubscribe:content-type; bh=0M8UL8e/T156D4YRJndNssrahqznBivzEW/cSg1fXJM=; b=Vhwayq1gNjTNglhzpap97zVoVctGXzaylRvD9eQgmfGlymFBh39DWDz9wEKARiAazZ /rr8QQupB+rcVRRkOep5Oq8HZMRBlLdL334grCxACXauVr4Mdiiaxl5Ye4gv9Q+E1T/2 +Ck8JiqgIGCTkoQOZvJwDwHc8czoJVQ+6xGeQF9MVzxvz6ojc3IrNundgGrEPPNtiTC5 /3rGZELQ9c3/V7geSmd9KsfKVUITwbc5Jfje13kvFwwtJJlNimUT0/h0jFZ+qudLCChM 3W1jKbHBgfn6WKIutH25ckt+9xlikkTqt3qYbS2c4ouMrTnCzcsSbz4b2euypu3Q6Ksq eJ6w==
List-archive: <http://groups.google.com/group/philly-lambda?hl=en_US>
Mailing-list: list philly-lambda@googlegroups.com; contact philly-lambda+owners@googlegroups.com
Reply-to: philly-lambda@googlegroups.com
Sender: philly-lambda@googlegroups.com

Nice job Rich!

StarCluster looks interesting I should really check it out when I have time.
Just curious, how big is your cluster and how long does it take for
the job to finish?

Aaron

On Sun, Oct 28, 2012 at 10:04 AM, Rich Freeman <rich@thefreemanclan.net> wrote:
> On Monday, October 8, 2012 7:56:38 PM UTC-4, Rich Freeman wrote:
>>
>>
>> If anybody wants to follow along I'm keeping everything here:
>> git://github.com/rich0/gitvalidate.git
>>
>
> I figured I'd give a little update on how this is going, focusing more on
> the functional aspects.  I ended up implementing this using hadoop streaming
> with python map and reduce scripts.  I'm using starcluster with a custom ami
> that includes some python modules preinstalled.  I attach an ebs volume
> containing a copy of the git repo to each node.
>
> Map takes in one or more csv rows and for each outputs either the same row
> if it is a blob, and one row for each item in the next level of the tree if
> the input was a tree (discarding the parent tree row).  I'm using pygit2 to
> read the git repository - spawning git was just wasting too much time.  My
> csv rows use base64 for anything containing line breaks to keep things
> simple.
>
> Reduce takes the rows for a single file/tree, sorts them by timestamp, and
> then drops consecutive duplicate hashes.  That means that I'm only
> traversing the next tree level for entries that change, which is typically
> only one per commit until you get to the bottom of the tree.
>
> The only weakness I've detected in my actual algorithm is that doesn't
> detect file deletions.  I capture all blobs/trees that are present, and then
> drop ones that haven't changed from the previous commit.  Since the commit
> that drops a file doesn't contain its blob to begin with, it never gets
> captured.  To detect deletions I'd probably need to do pairwise comparisons.
> For now I plan to live with this.
>
> I'm not parallelizing the cvs side currently, but that should be trivial to
> parallelize since each file has a completely independent history.
>
> As far as practical results go - I did actually spot a bug in the converter
> that is mangling file headers.  I also spotted some files revisions with odd
> rcs revision numbers getting dropped.  The amount of data transformation
> during cvs->git conversion is greater than I had expected, which makes
> actually comparing the data harder.
>
> Oh, if anybody is aware of any decent visual diffing tools let me know. I've
> been using meld, but that tries to load the entire files into RAM and
> they're way to big for that, and its character-level diffing doesn't work
> well if there is no whitespace in the files.
>
> I post a bit more to the gentoo-scm list so feel to follow that if you're
> interested.  Thanks for the suggestions that were provided here.
>
> Rich

Follow-Ups:
- Re: Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>

References:
- Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>
- Re: Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>
- Re: Intro / Question
  - From: Michael Bevilacqua-Linn <michael.bevilacqualinn@gmail.com>
- Intro / Question
  - From: Dustin Getz <dustin.getz@gmail.com>
- Re: Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>
- Re: Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>

Prev by Date: Re: Intro / Question
Next by Date: Re: Intro / Question
Previous by thread: Re: Intro / Question
Next by thread: Re: Intro / Question
Index(es):
- Date
- Thread