Rich Freeman on 5 Oct 2012 03:25:07 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Intro / Question


Hi, first post to the group and I've yet to attend a meeting, so I'll start with an intro.  My name is Rich Freeman and I've been attending PLUG for a while now and learned of this group from there.  I contribute to a few FOSS projects, most notably the Gentoo distribution where Iâm a trustee.

I've been working with computers since about 2nd grade, but my formal education is in the sciences. I don't code that much these days, and have never written in a functional programming language, but have been on the lookout for a project to start, and I think I might have found one.

Having some background in student journalism I outlined this email in what I think is going to be a reader-friendly format...

How you Can Help Me
I would like guidance on what functional language to use, and on how I might execute it on a readily available cluster solution on something like EC2/etc.  Iâm pretty sure I can figure this all out on my own, but starting off on the right direction will save a lot of rework later.  I think the problem can be addressed with functional programming, but it has a bit of a âgotchaâ that might create problems with some tools.

The End Goal
The problem to be solved is this:  Gentoo Linux is converting from cvs to git, and has had problems with things like text encoding in existing conversion tools, so would like a way to independently validate that a git repository and a cvs repository are âidentical.â  The problem is greatly simplified by the fact that these repositories do not use branching and have a completely linear history.

Solution Approach
My proposed solution is to create a pair of programs.  They will take either a git or cvs repository and produce a sorted list containing these fields:
path/file
timestamp
hash
committer
 
After running the files will be compared, and if theyâre identical, the repositories are (insofar as we care).  Cvs basically stores per-file history anyway.  Iâd like to use the git and cvs executables to do any reading of the repositories to ensure that there are no errors in âinterpretation.â

Functional Implementation
This problem appears to me to be completely parallelizable, and a functional approach might be the easiest to implement this with existing tools (that is a big maybe).  Iâve only thought this through in detail for the git case (Iâm more familiar with the inner workings here), but cvs with its per-file histories seems quite amenable as well.
Keep in mind I know enough about functional programming to be dangerous.  Here is how Iâd do it with a Map/Reduce approach:

Pre-Map: Walk through the git commits from head to tail and output a list containing the following fields:
type (will be tree in all cases here)
path/file (will be null in all cases here)
timestamp
hash
committer
 
This step cannot be run in parallel - weâre just traversing a linked list. 

Map: For each entry in the list, if the type is âblobâ then output the entry unmodified.  If the type is tree output an entry for each element of the tree (append to path, hash is that of the elements of the tree, and hash/commiter/message are whatever was in the input entry). 

Reduce: For each path/file sort all list entries by timestamp and then eliminate all but the first element if consecutive elements have the same hash.  The output could have duplicate hashes, but not on consecutive lines. 

Then take the output and feed it back into the Map/Reduce steps until the output contains only blob records.  The logic is to traverse the entire tree one level at a time.  In any commit most likely only one file changed, so getting rid of duplicates at each level of the tree will avoid having to walk the entire tree for every commit. 

So, the âgotchaâ I was talking about is that the Map function needs to call the git executable, and it needs access to the git repository (a few GB).  That is a read-only operation so there are no side-effects, but most Map/Reduce examples Iâve seen assume that the only data involved is the list youâre operating on.  You could view the git repository as just a really big lookup table.  So conceptually the need to access the repository shouldnât be a problem, but in terms of actual implementations for clusters/etc it might be problematic.   

If I were implementing this in a cluster Iâd think the simplest solution would be to have the git/cvs executable on all nodes, and a copy of the repository accessible to each node.  On something like EC2 Iâd just install git/cvs on the cluster worker images, and Iâd stick the repository in an EBS snapshot that each could mount and reference.  

So, howâs that for a first post?  Anybody have any pointers theyâd like to offer?  Iâm perfectly happy with building EC2 images and all that - most likely Iâd like them to run Gentoo, but if some platform-as-a-service solution will do the job I donât think that is a must, and Iâm sure I can get anything FOSS running on Gentoo anyway if it isnât already packaged.