Aaron Mansheim on 29 Oct 2012 21:20:20 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: Intro / Question |
I wonder whether vimdiff works better than meld on large files. It displays side-by-side, coloring the lines and words, and you probably have it already. Aaron Mansheim On Sun, 2012-10-28 at 07:04 -0700, Rich Freeman wrote: > On Monday, October 8, 2012 7:56:38 PM UTC-4, Rich Freeman wrote: > > If anybody wants to follow along I'm keeping everything here: > git://github.com/rich0/gitvalidate.git > > > > I figured I'd give a little update on how this is going, focusing more > on the functional aspects. I ended up implementing this using hadoop > streaming with python map and reduce scripts. I'm using starcluster > with a custom ami that includes some python modules preinstalled. I > attach an ebs volume containing a copy of the git repo to each node. > > > Map takes in one or more csv rows and for each outputs either the same > row if it is a blob, and one row for each item in the next level of > the tree if the input was a tree (discarding the parent tree row). > I'm using pygit2 to read the git repository - spawning git was just > wasting too much time. My csv rows use base64 for anything containing > line breaks to keep things simple. > > > Reduce takes the rows for a single file/tree, sorts them by timestamp, > and then drops consecutive duplicate hashes. That means that I'm only > traversing the next tree level for entries that change, which is > typically only one per commit until you get to the bottom of the > tree. > > > The only weakness I've detected in my actual algorithm is that doesn't > detect file deletions. I capture all blobs/trees that are present, > and then drop ones that haven't changed from the previous commit. > Since the commit that drops a file doesn't contain its blob to begin > with, it never gets captured. To detect deletions I'd probably need > to do pairwise comparisons. For now I plan to live with this. > > > I'm not parallelizing the cvs side currently, but that should be > trivial to parallelize since each file has a completely independent > history. > > > As far as practical results go - I did actually spot a bug in the > converter that is mangling file headers. I also spotted some files > revisions with odd rcs revision numbers getting dropped. The amount > of data transformation during cvs->git conversion is greater than I > had expected, which makes actually comparing the data harder. > > > Oh, if anybody is aware of any decent visual diffing tools let me > know. I've been using meld, but that tries to load the entire files > into RAM and they're way to big for that, and its character-level > diffing doesn't work well if there is no whitespace in the files. > > > I post a bit more to the gentoo-scm list so feel to follow that if > you're interested. Thanks for the suggestions that were provided > here. > > > Rich