Re: [PhillyOnRails] jruby + hadoop?

Toby DiPasquale on 6 Sep 2007 13:47:14 -0000

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PhillyOnRails] jruby + hadoop?

From: Toby DiPasquale <toby@cbcg.net>

To: talk@phillyonrails.org

Subject: Re: [PhillyOnRails] jruby + hadoop?

Date: Thu, 6 Sep 2007 09:46:57 -0400

List-archive: <http://lists.phillyonrails.org/pipermail/talk>

Reply-to: talk@phillyonrails.org

Sender: talk-bounces@phillyonrails.org

On Wed, Sep 05, 2007 at 08:54:57PM -0700, Aaron Blohowiak wrote: > That's not quite fair, Toby. I agree that Casey was wrong to suggest > that RDBMs' text search hacks are adequate means to handle > unstructured data, but I didn't care for your response. Casey raised a > much larger issue that you did not address. I usually respect your > opinion (if not your demeanor,) but I was disappointed to see you give > in to resorting to straw-man argument instead of continuing (mostly) > reasoned discussion. You're right. I apologize. > Please address what he mentions when he said: > "BerkeleyDBHA does exactly what pgpool/pgcluster/oracle/etc all do -- it > intercepts the sql log and splits up and sends parts out to the nodes > for processing. What does that have to do with 'statefulness'? As far as > I can imagine, there are two possible interpretations of 'stateless > database': either a database that has no knowledge of which agent is > initiating the transaction, or a database that returns the same result > for any transaction at any time. In the former case, every database > would be considered stateless. In the latter case, only a readonly > database is stateless, and all others are stateful, including > BerkeleyDBHA. Either way, judging a database as 'stateful' or > 'stateless' seems nonsensical to me. I think the author of the link > above would have done better with the term 'balanced' or 'load balanced' > or 'concurrent'. The comparison between RDBMS and HTTP therefore doesn't > make sense to me in the context of 'statelessness'." > > State vs statelessness on the DB level is tricky for me to grasp. Can > you shed some light on the issue? The trick to understanding the state here is that its everywhere in a traditional RDBMS, not just in the data. The schema and the query both have state. The schema cannot be changed without big pain (the bigger the data and the more tables, the more pain). As a result, the SQL queries themselves than also carry state that limit their parallelizability. Plus, the very power of SQL limits it in other ways, too. Perhaps a better way to think about this is this: SQL gives you the full range of power to query structured data in any way you choose. MapReduce gives you a handful of operations, chosen very carefully, to query and filter semi-structured data. MapReduce programmers appear to be more constrained than SQL programmers. However, those who implement these systems are in reverse positions. SQL implementers have to implement lots and lots of stuff and this constrains what they can really do. MapReduce chooses a few operations and those are chosen to allow a range of implementations (the most important of which is, of course, the distributed one from the paper and Hadoop). I guess the underlying point here is that RDBMSs handle structured data and are very good at that, so if you have structured data, you should use one. However, if you don't, you should look at something different. A lot of the complicated overhead and complexity that comes with RDBMSs is to enforce the structure of the data; that is, after all, its raison d'etre. RDBMSs will never scale to the size of a Google cluster because of how they work, not because of what they contain. So, in the end, when we refer to stateful versus stateless, we are talking about the database itself, not the data inside of it. If you'd like a pretty good book on how SQL works under the hood and the math behind it, check out The Art of SQL by Stephane Faroult. -- Toby DiPasquale _______________________________________________ To unsubscribe or change your settings, visit: http://lists.phillyonrails.org/mailman/listinfo/talk

Follow-Ups:

Re: [PhillyOnRails] jruby + hadoop?
From: Cassius Rosenthal <cassius@xmodulation.com>

RE: [PhillyOnRails] jruby + hadoop?
From: "Mort Goldman" <mort.goldman@408west.com>

References:

Re: [PhillyOnRails] jruby + hadoop?
From: Mat Schaffer <schapht@gmail.com>

Re: [PhillyOnRails] jruby + hadoop?
From: "Michael Bevilacqua-Linn" <michael.bevilacqualinn@gmail.com>

Re: [PhillyOnRails] jruby + hadoop?
From: Mat Schaffer <schapht@gmail.com>

Re: [PhillyOnRails] jruby + hadoop?
From: Cassius Rosenthal <cassius@xmodulation.com>

Re: [PhillyOnRails] jruby + hadoop?
From: "Evan Weaver" <evan@cloudbur.st>

Re: [PhillyOnRails] jruby + hadoop?
From: Cassius Rosenthal <cassius@xmodulation.com>

Re: [PhillyOnRails] jruby + hadoop?
From: Toby DiPasquale <toby@cbcg.net>

Re: [PhillyOnRails] jruby + hadoop?
From: Cassius Rosenthal <cassius@xmodulation.com>

Re: [PhillyOnRails] jruby + hadoop?
From: Toby DiPasquale <toby@cbcg.net>

Re: [PhillyOnRails] jruby + hadoop?
From: "Aaron Blohowiak" <aaron.blohowiak@gmail.com>

Prev by Date: Re: [PhillyOnRails] Microcontroller "models" a loaded word...

Next by Date: Re: [PhillyOnRails] jruby + hadoop?

Previous by thread: Re: [PhillyOnRails] jruby + hadoop?

Next by thread: Re: [PhillyOnRails] jruby + hadoop?

Index(es):

Date

Thread