Aaron Feng on 30 Oct 2012 08:04:01 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: Intro / Question |
Nice job Rich! StarCluster looks interesting I should really check it out when I have time. Just curious, how big is your cluster and how long does it take for the job to finish? Aaron On Sun, Oct 28, 2012 at 10:04 AM, Rich Freeman <rich@thefreemanclan.net> wrote: > On Monday, October 8, 2012 7:56:38 PM UTC-4, Rich Freeman wrote: >> >> >> If anybody wants to follow along I'm keeping everything here: >> git://github.com/rich0/gitvalidate.git >> > > I figured I'd give a little update on how this is going, focusing more on > the functional aspects. I ended up implementing this using hadoop streaming > with python map and reduce scripts. I'm using starcluster with a custom ami > that includes some python modules preinstalled. I attach an ebs volume > containing a copy of the git repo to each node. > > Map takes in one or more csv rows and for each outputs either the same row > if it is a blob, and one row for each item in the next level of the tree if > the input was a tree (discarding the parent tree row). I'm using pygit2 to > read the git repository - spawning git was just wasting too much time. My > csv rows use base64 for anything containing line breaks to keep things > simple. > > Reduce takes the rows for a single file/tree, sorts them by timestamp, and > then drops consecutive duplicate hashes. That means that I'm only > traversing the next tree level for entries that change, which is typically > only one per commit until you get to the bottom of the tree. > > The only weakness I've detected in my actual algorithm is that doesn't > detect file deletions. I capture all blobs/trees that are present, and then > drop ones that haven't changed from the previous commit. Since the commit > that drops a file doesn't contain its blob to begin with, it never gets > captured. To detect deletions I'd probably need to do pairwise comparisons. > For now I plan to live with this. > > I'm not parallelizing the cvs side currently, but that should be trivial to > parallelize since each file has a completely independent history. > > As far as practical results go - I did actually spot a bug in the converter > that is mangling file headers. I also spotted some files revisions with odd > rcs revision numbers getting dropped. The amount of data transformation > during cvs->git conversion is greater than I had expected, which makes > actually comparing the data harder. > > Oh, if anybody is aware of any decent visual diffing tools let me know. I've > been using meld, but that tries to load the entire files into RAM and > they're way to big for that, and its character-level diffing doesn't work > well if there is no whitespace in the files. > > I post a bit more to the gentoo-scm list so feel to follow that if you're > interested. Thanks for the suggestions that were provided here. > > Rich