bergman on 10 Mar 2016 18:02:37 -0800 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Command line tools faster than Hadoop cluster |
In the message dated: Thu, 10 Mar 2016 18:42:23 -0500, The pithy ruminations from Eric Lucas on <[PLUG] Command line tools faster than Hadoop cluster> were: => I just throw this out for your perusal - I stumbled across it this morning => and found it interesting. Thanks for sending that around. I found it interesting, and was thinking of showing it some of the data analysts here at $WORK ...and then the author lost [almost] all credibility at only the 2nd example due to the useless use of cat[1] when discussing optimization: cat *.png | grep "Results" That pipe is not in anyway "parallel"...if it was, then cat *.txt | cat | cat | cat | grep "Results" would be even faster! Next is the useless use of grep, where: cat *.pgn | grep "Result" | awk '{split($0, a, "-"); .... could be replaced with awk '{ if ( $0 == "Result" ) { split($0, a, "-"); .... *.pgn (hmmm... I'd need to check whether awk buffers all input if it's reading from files instead of stdin...maybe this would be slower). Then we've got all those substring matches & stuff in awk. I wonder if it'd be faster to use the data as a numerical value rather than working with strings... Remember, the 3 fields are: value chess game outcome ===== ================== 1/2 - 1/2 draw 0 - 1 black won 1 - 0 white won those look suspiciously like, um, math to me. It'd take a bit of fiddling with awk (or probably perl, or python, or something else), but I bet that treating those lines as equations rather than strings would be more efficient. After all that nitpicking, I really, really did like the trick of parallelizing through find & xargs. For me, the big take-away is to use the right tool (or class of tools) for the job, and even "big data" tools can be slow when used improperly. Mark [1] http://porkmail.org/era/unix/award.html => => http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html => => I knew about the shell behavior but never really thought of it as => 'parallel' processing. DOH! => => Even the implementer of the Hadoop solution admitted it was not the right => solution but I empathize with his desire to learn while solving problems. => => Eric => => BTW it was linked from this article: => http://idlewords.com/talks/website_obesity.htm => which is amazing => => ___________________________________________________________________________ => Philadelphia Linux Users Group -- http://www.phillylinux.org => Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce => General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug => -- Mark Bergman Biker, Rock Climber, SCUBA Diver, Unix mechanic, IATSE #1 Stage hand '94 Yamaha GTS1000A^2 bergman@panix.com https://www.flickr.com/photos/r msppu http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40panix.com I want a newsgroup with a infinite S/N ratio! Now taking CFV on: rec.motorcycles.stagehands.pet-bird-owners.pinballers.unix-supporters 15+ So Far--Want to join? Check out: http://www.panix.com/~bergman ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug