Rich Freeman on 20 Mar 2013 09:56:37 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: Intro / Question |
On Tue, Oct 30, 2012 at 11:17 AM, Rich Freeman <rich@thefreemanclan.net> wrote: > On Tue, Oct 30, 2012 at 11:03 AM, Aaron Feng <aaron.feng@gmail.com> wrote: >> StarCluster looks interesting I should really check it out when I have time. >> Just curious, how big is your cluster and how long does it take for >> the job to finish? > > I can run 12 iterations on an 8-node m1.large cluster on EC2 on a > repository with 600k commits that is about 2GB in size. That is with > standard EBS volumes - not the IO-optimized ones. After a while I'd > think the entire git repository would be cached since this is > read-only access, so that shouldn't matter too much. The first 2-3 > iterations take most of the time. > Since it has been a while since I gave an update on this I figured I'd share a quick update on where this ended up. For the most part hadoop was working fine, but I found that the overhead of transferring all the data to EC2 and getting the cluster running wasn't really worth it when the whole processing job only took maybe two hours. Obviously hadoop has the advantage that I could scale this problem arbitrarily but it was just overkill in this case. I also was getting some issues with records in my final output which should not have been there. This is probably the result of the sort phase of hadoop not getting all the records with the same key into the same reduce node together, which would seem like a bug. However, I didn't really pursue this far enough to confirm what is going on. What I did confirm is that if I just piped the map script into sort and into the reduce script I got the correct output. In the end because of this issue and the performance issues I ended up running my map/reduce programs on top of GNU parallel. They all worked via stdin/out anyway, so as long as I don't mind being limited to a single node I can use GNU parallel to run the map scripts in parallel. It actually has one performance advantage over hadoop - this is an iterative process and it can run the map step of the next stage in parallel with the reduce step of the previous stage. With some simple shell piping I couldn't get reduce to run in parallel, but since it runs in parallel with the next stage map it doesn't really slow anything down. If you look at the script basically all it is doing is grabbing input, and piping it into a map | sort | reduce | map | sort | reduce command (repeated 13 times). The sort steps of course prevent the pipeline from streaming completely start-to-finish, but it keeps the CPU quite warm all the same (oh, and it is important to ensure you have plenty of TMPDIR space for the sort). I probably won't be tweaking this much more - the project has just about served its purpose. However, the sources are all on github: git://github.com/rich0/gitvalidate.git Oh, and if you are looking for some scripts to set up some basic clustering tools on EC2 definitely check out starcluster. It isn't 100% there feature-wise (removing hadoop nodes isn't really well-supported - aside from just taking them down and making the cluster deal with it). However, it does do a fair bit of automation to make things easier (nfs shares across the cluster, shortcuts to ssh into various nodes, key management, etc). If you need to tweak the node configuration it is pretty easy to just spawn a node, reconfigure it (install packages/etc), reimage it, and use that as the node image for a cluster. Rich -- --- You received this message because you are subscribed to the Google Groups "Philly Lambda" group. To unsubscribe from this group and stop receiving emails from it, send an email to philly-lambda+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.