bergman on 10 Mar 2016 18:02:37 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Command line tools faster than Hadoop cluster

In the message dated: Thu, 10 Mar 2016 18:42:23 -0500,
The pithy ruminations from Eric Lucas on 
<[PLUG] Command line tools faster than Hadoop cluster> were:
=> I just throw this out for your perusal - I stumbled across it this morning
=> and found it interesting.

Thanks for sending that around.

I found it interesting, and was thinking of showing it some of the
data analysts here at $WORK ...and then the author lost [almost] all
credibility at only the 2nd example due to the useless use of cat[1]
when discussing optimization:
	cat *.png | grep "Results"

That pipe is not in anyway "parallel"...if it was, then 
	cat *.txt | cat | cat | cat | grep "Results"
would be even faster!

Next is the useless use of grep, where:
	cat *.pgn | grep "Result" | awk '{split($0, a, "-"); ....

could be replaced with
	awk '{ if ( $0 == "Result" ) { split($0, a, "-"); ....   *.pgn
(hmmm... I'd need to check whether awk buffers all input if it's reading from
files instead of stdin...maybe this would be slower).

Then we've got all those substring matches & stuff in awk. I wonder if
it'd be faster to use the data as a numerical value rather than working
with strings...

Remember, the 3 fields are:

	value		chess game outcome
	=====		==================
	1/2 - 1/2	draw
	0 - 1		black won
	1 - 0		white won

those look suspiciously like, um, math to me. It'd take a bit of fiddling
with awk (or probably perl, or python, or something else), but I bet
that treating those lines as equations rather than strings would be
more efficient.

After all that nitpicking, I really, really did like the trick of
parallelizing through find & xargs.

For me, the big take-away is to use the right tool (or class of tools)
for the job, and even "big data" tools can be slow when used improperly.



=> I knew about the shell behavior but never really thought of it as
=> 'parallel' processing.   DOH!
=> Even the implementer of the Hadoop solution admitted it was not the right
=> solution but I empathize with his desire to learn while solving problems.
=> Eric
=> BTW it was linked from this article:
=> which is amazing
=> ___________________________________________________________________________
=> Philadelphia Linux Users Group         --
=> Announcements -
=> General Discussion  --
Mark Bergman    Biker, Rock Climber, SCUBA Diver, Unix mechanic, IATSE #1 Stage
'94 Yamaha GTS1000A^2

I want a newsgroup with a infinite S/N ratio! Now taking CFV on:
15+ So Far--Want to join? Check out: 
Philadelphia Linux Users Group         --
Announcements -
General Discussion  --