Rich Freeman on 8 Oct 2012 16:56:42 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Intro / Question


On Mon, Oct 8, 2012 at 7:32 PM, Dustin Getz <dustin.getz@gmail.com> wrote:
>> independently validate that a git repository and a cvs repository are
>> “identical.”
>
> is it sufficient to validate the end goal, and not the intermediate history?
> (which means you can just export the final state, hash the exports and
> compare.) I ask because...

Nope.  That would indeed be easy - just checkout CVS, checkout git,
and do a recursive compare.

>
>> Cvs basically stores per-file history anyway.  I’d like to use the git and
>> cvs executables to do any reading of the repositories to ensure that there
>> are no errors in “interpretation.”
>
> this is the step which may give you trouble, due to the impedance mismatch
> you mention: CVS commits are per-file, git commits are per repo. I'm not
> sure how history import tools handle this, but i think you will need to
> understand it in order to validate it, because it seems to me that there are
> several perfectly valid ways to convert cvs history into git history.

Definitely the case.  I believe the conversion tool basically just
creates a commit for every file change it finds (as if you had
committed one file at a time).  It might have some logic to try to
find identical timestamps/authors and match them together.  That's why
I want to break apart the commits in git so that we're just looking at
a per-file history.  What does help is that gentoo has a purely linear
history with no branches (granted, I don't think most of the really
convoluted graphs you can create with git are possible with cvs
anyway).

I've made some progress with this, but am running into some
limitations with my first design.  The obvious one is CPU, but my
original design just doing everything in python lists is gobbling
quite a bit of RAM as well - it will never scale to the 1.2M commits
in the full repository.  However, if I just pick the most recent 1000
commits it works just fine and gives me a dump.  At one point I even
used the lamda operator, though I had to ditch this when the
multiprocessing module complained that functions defined within
functions are not "pickable."  All of this has been a great intro to
python though and I find myself fighting the language less as I go
along.

So, now my strategy is to move more in the direction that MBL proposed
- though I'll probably try dumbo before I just go with hadoop.  If I
run it using starcluster it seems like it should be easy to install
packages on the workers and run a shell script on the cluster to mount
ebs volumes from a snapshot containing the git repository - with
vanilla elastic compute that might be more of a pain.  I have a python
script that will walk the commit history in 5 minutes and dump it to a
csv file (with some fields in base64 to get around newline issues),
which hadoop should be able to work with easily.

I also gave up on directly invoking git and instead am using pygit2.
Invoking git worked (aside from some unicode issues that I didn't
bother to solve), but there was way too much system overhead, and all
that screen-scraping was a pain as well and bound to fail at some
point.  I'll just cross my fingers that there are no bugs (though we'd
need an identical bug in the converter and validator to cause a real
problem).

My reduce algorithm definitely works well - probably 99%+ of the
records in the first pass are dropped before entering the next.

If anybody wants to follow along I'm keeping everything here:
git://github.com/rich0/gitvalidate.git

Rich