| bergman on 10 Mar 2016 18:02:37 -0800 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
| Re: [PLUG] Command line tools faster than Hadoop cluster |
In the message dated: Thu, 10 Mar 2016 18:42:23 -0500,
The pithy ruminations from Eric Lucas on
<[PLUG] Command line tools faster than Hadoop cluster> were:
=> I just throw this out for your perusal - I stumbled across it this morning
=> and found it interesting.
Thanks for sending that around.
I found it interesting, and was thinking of showing it some of the
data analysts here at $WORK ...and then the author lost [almost] all
credibility at only the 2nd example due to the useless use of cat[1]
when discussing optimization:
cat *.png | grep "Results"
That pipe is not in anyway "parallel"...if it was, then
cat *.txt | cat | cat | cat | grep "Results"
would be even faster!
Next is the useless use of grep, where:
cat *.pgn | grep "Result" | awk '{split($0, a, "-"); ....
could be replaced with
awk '{ if ( $0 == "Result" ) { split($0, a, "-"); .... *.pgn
(hmmm... I'd need to check whether awk buffers all input if it's reading from
files instead of stdin...maybe this would be slower).
Then we've got all those substring matches & stuff in awk. I wonder if
it'd be faster to use the data as a numerical value rather than working
with strings...
Remember, the 3 fields are:
value chess game outcome
===== ==================
1/2 - 1/2 draw
0 - 1 black won
1 - 0 white won
those look suspiciously like, um, math to me. It'd take a bit of fiddling
with awk (or probably perl, or python, or something else), but I bet
that treating those lines as equations rather than strings would be
more efficient.
After all that nitpicking, I really, really did like the trick of
parallelizing through find & xargs.
For me, the big take-away is to use the right tool (or class of tools)
for the job, and even "big data" tools can be slow when used improperly.
Mark
[1] http://porkmail.org/era/unix/award.html
=>
=> http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
=>
=> I knew about the shell behavior but never really thought of it as
=> 'parallel' processing. DOH!
=>
=> Even the implementer of the Hadoop solution admitted it was not the right
=> solution but I empathize with his desire to learn while solving problems.
=>
=> Eric
=>
=> BTW it was linked from this article:
=> http://idlewords.com/talks/website_obesity.htm
=> which is amazing
=>
=> ___________________________________________________________________________
=> Philadelphia Linux Users Group -- http://www.phillylinux.org
=> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
=> General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
=>
--
Mark Bergman Biker, Rock Climber, SCUBA Diver, Unix mechanic, IATSE #1 Stage
hand
'94 Yamaha GTS1000A^2
bergman@panix.com https://www.flickr.com/photos/r
msppu
http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40panix.com
I want a newsgroup with a infinite S/N ratio! Now taking CFV on:
rec.motorcycles.stagehands.pet-bird-owners.pinballers.unix-supporters
15+ So Far--Want to join? Check out: http://www.panix.com/~bergman
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug