Re: Intro / Question

Aaron Mansheim on 29 Oct 2012 21:20:20 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Intro / Question

From: Aaron Mansheim <minopret@gmail.com>
To: philly-lambda@googlegroups.com
Subject: Re: Intro / Question
Date: Tue, 30 Oct 2012 00:20:13 -0400
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=x-beenthere:received-spf:message-id:subject:from:to:date :in-reply-to:references:x-mailer:mime-version:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-google-group-id:list-post:list-help:list-archive:sender :list-unsubscribe:content-type:content-transfer-encoding; bh=t6DVbCK9sGP9wMlljpTGmvBLYW3fUnLQ3BZk+QPH/s0=; b=WBvEG8YaTz4enboio7QXmdgZKNNv5R2ALVYDD5RWhl5/1oh85Ack1Y9shHK4LaCi3n nAz6i7oyyQsuw6j23snj4SjZRACdYlw0wdcY+5lxEdRUnBg0962cRkzH/jld6xNkzP23 kqSo5x7LiBkVlZp5PzmcOLuzIotIDRh6d6WNTGp5AVsQtA4V9vHOO+nCxyHuR9SXqUcc XTFzSniPMpBysyUFeUdM8uFd0ZQt2qo09vfnuP7i6aCcrXpdUGuSFLmdg3PQmIPPiSig DH43tfjKee9MazmUjjnlrmV/OsURPpRzazsXy+Pop8rVpO0jzmv39GfKYH/Q3Gi+7dPX tR/Q==
List-archive: <http://groups.google.com/group/philly-lambda?hl=en_US>
Mailing-list: list philly-lambda@googlegroups.com; contact philly-lambda+owners@googlegroups.com
Reply-to: philly-lambda@googlegroups.com
Sender: philly-lambda@googlegroups.com

I wonder whether vimdiff works better than meld on large files.
It displays side-by-side, coloring the lines and words, and
you probably have it already.

Aaron Mansheim


On Sun, 2012-10-28 at 07:04 -0700, Rich Freeman wrote:
> On Monday, October 8, 2012 7:56:38 PM UTC-4, Rich Freeman wrote:
>         
>         If anybody wants to follow along I'm keeping everything here: 
>         git://github.com/rich0/gitvalidate.git 
>         
> 
> 
> I figured I'd give a little update on how this is going, focusing more
> on the functional aspects.  I ended up implementing this using hadoop
> streaming with python map and reduce scripts.  I'm using starcluster
> with a custom ami that includes some python modules preinstalled.  I
> attach an ebs volume containing a copy of the git repo to each node.
> 
> 
> Map takes in one or more csv rows and for each outputs either the same
> row if it is a blob, and one row for each item in the next level of
> the tree if the input was a tree (discarding the parent tree row).
>  I'm using pygit2 to read the git repository - spawning git was just
> wasting too much time.  My csv rows use base64 for anything containing
> line breaks to keep things simple.
> 
> 
> Reduce takes the rows for a single file/tree, sorts them by timestamp,
> and then drops consecutive duplicate hashes.  That means that I'm only
> traversing the next tree level for entries that change, which is
> typically only one per commit until you get to the bottom of the
> tree.  
> 
> 
> The only weakness I've detected in my actual algorithm is that doesn't
> detect file deletions.  I capture all blobs/trees that are present,
> and then drop ones that haven't changed from the previous commit.
>  Since the commit that drops a file doesn't contain its blob to begin
> with, it never gets captured.  To detect deletions I'd probably need
> to do pairwise comparisons.  For now I plan to live with this.
> 
> 
> I'm not parallelizing the cvs side currently, but that should be
> trivial to parallelize since each file has a completely independent
> history.  
> 
> 
> As far as practical results go - I did actually spot a bug in the
> converter that is mangling file headers.  I also spotted some files
> revisions with odd rcs revision numbers getting dropped.  The amount
> of data transformation during cvs->git conversion is greater than I
> had expected, which makes actually comparing the data harder.
> 
> 
> Oh, if anybody is aware of any decent visual diffing tools let me
> know. I've been using meld, but that tries to load the entire files
> into RAM and they're way to big for that, and its character-level
> diffing doesn't work well if there is no whitespace in the files.
> 
> 
> I post a bit more to the gentoo-scm list so feel to follow that if
> you're interested.  Thanks for the suggestions that were provided
> here.
> 
> 
> Rich

References:
- Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>
- Re: Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>
- Re: Intro / Question
  - From: Michael Bevilacqua-Linn <michael.bevilacqualinn@gmail.com>
- Intro / Question
  - From: Dustin Getz <dustin.getz@gmail.com>
- Re: Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>
- Re: Intro / Question
  - From: Rich Freeman <rich@thefreemanclan.net>

Prev by Date: Re: Intro / Question
Next by Date: Re: Intro / Question
Previous by thread: Re: Intro / Question
Next by thread: Re: Intro / Question
Index(es):
- Date
- Thread