Rich Freeman on 20 Mar 2013 09:56:37 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Intro / Question


On Tue, Oct 30, 2012 at 11:17 AM, Rich Freeman <rich@thefreemanclan.net> wrote:
> On Tue, Oct 30, 2012 at 11:03 AM, Aaron Feng <aaron.feng@gmail.com> wrote:
>> StarCluster looks interesting I should really check it out when I have time.
>> Just curious, how big is your cluster and how long does it take for
>> the job to finish?
>
> I can run 12 iterations on an 8-node m1.large cluster on EC2 on a
> repository with 600k commits that is about 2GB in size.  That is with
> standard EBS volumes - not the IO-optimized ones.  After a while I'd
> think the entire git repository would be cached since this is
> read-only access, so that shouldn't matter too much.  The first 2-3
> iterations take most of the time.
>

Since it has been a while since I gave an update on this I figured I'd
share a quick update on where this ended up.

For the most part hadoop was working fine, but I found that the
overhead of transferring all the data to EC2 and getting the cluster
running wasn't really worth it when the whole processing job only took
maybe two hours.  Obviously hadoop has the advantage that I could
scale this problem arbitrarily but it was just overkill in this case.

I also was getting some issues with records in my final output which
should not have been there.  This is probably the result of the sort
phase of hadoop not getting all the records with the same key into the
same reduce node together, which would seem like a bug.  However, I
didn't really pursue this far enough to confirm what is going on.
What I did confirm is that if I just piped the map script into sort
and into the reduce script I got the correct output.

In the end because of this issue and the performance issues I ended up
running my map/reduce programs on top of GNU parallel.  They all
worked via stdin/out anyway, so as long as I don't mind being limited
to a single node I can use GNU parallel to run the map scripts in
parallel.  It actually has one performance advantage over hadoop -
this is an iterative process and it can run the map step of the next
stage in parallel with the reduce step of the previous stage.  With
some simple shell piping I couldn't get reduce to run in parallel, but
since it runs in parallel with the next stage map it doesn't really
slow anything down.  If you look at the script basically all it is
doing is grabbing input, and piping it into a map | sort | reduce |
map | sort | reduce command (repeated 13 times).  The sort steps of
course prevent the pipeline from streaming completely start-to-finish,
but it keeps the CPU quite warm all the same (oh, and it is important
to ensure you have plenty of TMPDIR space for the sort).

I probably won't be tweaking this much more - the project has just
about served its purpose.  However, the sources are all on github:
git://github.com/rich0/gitvalidate.git

Oh, and if you are looking for some scripts to set up some basic
clustering tools on EC2 definitely check out starcluster.  It isn't
100% there feature-wise (removing hadoop nodes isn't really
well-supported - aside from just taking them down and making the
cluster deal with it).  However, it does do a fair bit of automation
to make things easier (nfs shares across the cluster, shortcuts to ssh
into various nodes, key management, etc).  If you need to tweak the
node configuration it is pretty easy to just spawn a node, reconfigure
it (install packages/etc), reimage it, and use that as the node image
for a cluster.

Rich

-- 

--- 
You received this message because you are subscribed to the Google Groups "Philly Lambda" group.
To unsubscribe from this group and stop receiving emails from it, send an email to philly-lambda+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.