Re: [PLUG] Command line tools faster than Hadoop cluster

bergman on 10 Mar 2016 18:02:37 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Command line tools faster than Hadoop cluster

From: bergman@merctech.com
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Subject: Re: [PLUG] Command line tools faster than Hadoop cluster
Date: Thu, 10 Mar 2016 21:02:32 -0500
Reply-to: bergman@merctech.com, Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Sender: "plug" <plug-bounces@lists.phillylinux.org>

In the message dated: Thu, 10 Mar 2016 18:42:23 -0500,
The pithy ruminations from Eric Lucas on 
<[PLUG] Command line tools faster than Hadoop cluster> were:
=> I just throw this out for your perusal - I stumbled across it this morning
=> and found it interesting.

Thanks for sending that around.

I found it interesting, and was thinking of showing it some of the
data analysts here at $WORK ...and then the author lost [almost] all
credibility at only the 2nd example due to the useless use of cat[1]
when discussing optimization:
	cat *.png | grep "Results"

That pipe is not in anyway "parallel"...if it was, then 
	cat *.txt | cat | cat | cat | grep "Results"
would be even faster!

Next is the useless use of grep, where:
	cat *.pgn | grep "Result" | awk '{split($0, a, "-"); ....

could be replaced with
	awk '{ if ( $0 == "Result" ) { split($0, a, "-"); ....   *.pgn
(hmmm... I'd need to check whether awk buffers all input if it's reading from
files instead of stdin...maybe this would be slower).

Then we've got all those substring matches & stuff in awk. I wonder if
it'd be faster to use the data as a numerical value rather than working
with strings...

Remember, the 3 fields are:

	value		chess game outcome
	=====		==================
	1/2 - 1/2	draw
	0 - 1		black won
	1 - 0		white won

those look suspiciously like, um, math to me. It'd take a bit of fiddling
with awk (or probably perl, or python, or something else), but I bet
that treating those lines as equations rather than strings would be
more efficient.

After all that nitpicking, I really, really did like the trick of
parallelizing through find & xargs.

For me, the big take-away is to use the right tool (or class of tools)
for the job, and even "big data" tools can be slow when used improperly.

Mark

[1] http://porkmail.org/era/unix/award.html

=> 
=> http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
=> 
=> I knew about the shell behavior but never really thought of it as
=> 'parallel' processing.   DOH!
=> 
=> Even the implementer of the Hadoop solution admitted it was not the right
=> solution but I empathize with his desire to learn while solving problems.
=> 
=> Eric
=> 
=> BTW it was linked from this article:
=> http://idlewords.com/talks/website_obesity.htm
=> which is amazing
=> 
=> ___________________________________________________________________________
=> Philadelphia Linux Users Group         --        http://www.phillylinux.org
=> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
=> General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
=> 
-- 
Mark Bergman    Biker, Rock Climber, SCUBA Diver, Unix mechanic, IATSE #1 Stage
hand
'94 Yamaha GTS1000A^2
bergman@panix.com 				https://www.flickr.com/photos/r
msppu

http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40panix.com

I want a newsgroup with a infinite S/N ratio! Now taking CFV on:
rec.motorcycles.stagehands.pet-bird-owners.pinballers.unix-supporters
15+ So Far--Want to join? Check out: http://www.panix.com/~bergman 
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

Follow-Ups:
- Re: [PLUG] Command line tools faster than Hadoop cluster
  - From: brent timothy saner <brent.saner@gmail.com>

References:
- [PLUG] Command line tools faster than Hadoop cluster
  - From: Eric Lucas <eric@lucii.org>

Prev by Date: Re: [PLUG] Command line tools faster than Hadoop cluster
Next by Date: Re: [PLUG] Command line tools faster than Hadoop cluster
Previous by thread: Re: [PLUG] Command line tools faster than Hadoop cluster
Next by thread: Re: [PLUG] Command line tools faster than Hadoop cluster
Index(es):
- Date
- Thread