Toby DiPasquale on 6 Sep 2007 13:47:14 -0000

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PhillyOnRails] jruby + hadoop?

On Wed, Sep 05, 2007 at 08:54:57PM -0700, Aaron Blohowiak wrote:
> That's not quite fair, Toby. I agree that Casey was wrong to suggest
> that RDBMs' text search hacks are adequate means to handle
> unstructured data, but I didn't care for your response. Casey raised a
> much larger issue that you did not address. I usually respect your
> opinion (if not your demeanor,) but I was disappointed to see you give
> in to resorting to straw-man argument instead of continuing (mostly)
> reasoned discussion.

You're right. I apologize.

> Please address what he mentions when he said:
> "BerkeleyDBHA does exactly what pgpool/pgcluster/oracle/etc all do -- it
> intercepts the sql log and splits up and sends parts out to the nodes
> for processing. What does that have to do with 'statefulness'? As far as
> I can imagine, there are two possible interpretations of 'stateless
> database': either a database that has no knowledge of which agent is
> initiating the transaction, or a database that returns the same result
> for any transaction at any time. In the former case, every database
> would be considered stateless. In the latter case, only a readonly
> database is stateless, and all others are stateful, including
> BerkeleyDBHA. Either way, judging a database as 'stateful' or
> 'stateless' seems nonsensical to me. I think the author of the link
> above would have done better with the term 'balanced' or 'load balanced'
> or 'concurrent'. The comparison between RDBMS and HTTP therefore doesn't
> make sense to me in the context of 'statelessness'."
> State vs statelessness on the DB level is tricky for me to grasp. Can
> you shed some light on the issue?

The trick to understanding the state here is that its everywhere in a
traditional RDBMS, not just in the data. The schema and the query both
have state. The schema cannot be changed without big pain (the bigger the
data and the more tables, the more pain). As a result, the SQL queries
themselves than also carry state that limit their parallelizability. Plus,
the very power of SQL limits it in other ways, too.

Perhaps a better way to think about this is this: SQL gives you the full
range of power to query structured data in any way you choose. MapReduce
gives you a handful of operations, chosen very carefully, to query and 
filter semi-structured data. MapReduce programmers appear to be more
constrained than SQL programmers. However, those who implement these
systems are in reverse positions. SQL implementers have to implement lots
and lots of stuff and this constrains what they can really do. MapReduce
chooses a few operations and those are chosen to allow a range of
implementations (the most important of which is, of course, the
distributed one from the paper and Hadoop).

I guess the underlying point here is that RDBMSs handle structured data
and are very good at that, so if you have structured data, you should use
one. However, if you don't, you should look at something different. A lot
of the complicated overhead and complexity that comes with RDBMSs is to
enforce the structure of the data; that is, after all, its raison d'etre.
RDBMSs will never scale to the size of a Google cluster because of how
they work, not because of what they contain. So, in the end, when we refer
to stateful versus stateless, we are talking about the database itself,
not the data inside of it.

If you'd like a pretty good book on how SQL works under the hood and
the math behind it, check out The Art of SQL by Stephane Faroult.

Toby DiPasquale
To unsubscribe or change your settings, visit: