Aaron Mansheim on 29 Oct 2012 21:20:20 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Intro / Question


I wonder whether vimdiff works better than meld on large files.
It displays side-by-side, coloring the lines and words, and
you probably have it already.

Aaron Mansheim


On Sun, 2012-10-28 at 07:04 -0700, Rich Freeman wrote:
> On Monday, October 8, 2012 7:56:38 PM UTC-4, Rich Freeman wrote:
>         
>         If anybody wants to follow along I'm keeping everything here: 
>         git://github.com/rich0/gitvalidate.git 
>         
> 
> 
> I figured I'd give a little update on how this is going, focusing more
> on the functional aspects.  I ended up implementing this using hadoop
> streaming with python map and reduce scripts.  I'm using starcluster
> with a custom ami that includes some python modules preinstalled.  I
> attach an ebs volume containing a copy of the git repo to each node.
> 
> 
> Map takes in one or more csv rows and for each outputs either the same
> row if it is a blob, and one row for each item in the next level of
> the tree if the input was a tree (discarding the parent tree row).
>  I'm using pygit2 to read the git repository - spawning git was just
> wasting too much time.  My csv rows use base64 for anything containing
> line breaks to keep things simple.
> 
> 
> Reduce takes the rows for a single file/tree, sorts them by timestamp,
> and then drops consecutive duplicate hashes.  That means that I'm only
> traversing the next tree level for entries that change, which is
> typically only one per commit until you get to the bottom of the
> tree.  
> 
> 
> The only weakness I've detected in my actual algorithm is that doesn't
> detect file deletions.  I capture all blobs/trees that are present,
> and then drop ones that haven't changed from the previous commit.
>  Since the commit that drops a file doesn't contain its blob to begin
> with, it never gets captured.  To detect deletions I'd probably need
> to do pairwise comparisons.  For now I plan to live with this.
> 
> 
> I'm not parallelizing the cvs side currently, but that should be
> trivial to parallelize since each file has a completely independent
> history.  
> 
> 
> As far as practical results go - I did actually spot a bug in the
> converter that is mangling file headers.  I also spotted some files
> revisions with odd rcs revision numbers getting dropped.  The amount
> of data transformation during cvs->git conversion is greater than I
> had expected, which makes actually comparing the data harder.
> 
> 
> Oh, if anybody is aware of any decent visual diffing tools let me
> know. I've been using meld, but that tries to load the entire files
> into RAM and they're way to big for that, and its character-level
> diffing doesn't work well if there is no whitespace in the files.
> 
> 
> I post a bit more to the gentoo-scm list so feel to follow that if
> you're interested.  Thanks for the suggestions that were provided
> here.
> 
> 
> Rich