Cassius Rosenthal on 6 Sep 2007 00:12:46 -0000

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PhillyOnRails] jruby + hadoop?

In my view pretty much anything beats having all your data locked up
in a schema'd but not versioned, highly stateful, monolithic DB
I'm not sure why you would object to the stateful aspect of a DB -- either an app needs stateful data, or it does not, right?

State == bad. Whenever you can avoid state, you're a winner. Notice all
the pain of RDBMSs versus HTTP. Which of those two is stateless?

This is one thing that I'm having a hard time staying open-minded about: "all the pain of RDBMSs versus HTTP", implying that RDBMS is a pain in the arse, and HTTP is cake to work with. Procedural functions are one of the most enjoyable areas for me to code. If you find RDBMS to be painful, then I'm inclined to question your prejudice.

Furthermore, I strongly object to the use of 'stateless' as a modifier for 'database'. I saw that term used here as well:

BerkeleyDBHA does exactly what pgpool/pgcluster/oracle/etc all do -- it intercepts the sql log and splits up and sends parts out to the nodes for processing. What does that have to do with 'statefulness'? As far as I can imagine, there are two possible interpretations of 'stateless database': either a database that has no knowledge of which agent is initiating the transaction, or a database that returns the same result for any transaction at any time. In the former case, every database would be considered stateless. In the latter case, only a readonly database is stateless, and all others are stateful, including BerkeleyDBHA. Either way, judging a database as 'stateful' or 'stateless' seems nonsensical to me. I think the author of the link above would have done better with the term 'balanced' or 'load balanced' or 'concurrent'. The comparison between RDBMS and HTTP therefore doesn't make sense to me in the context of 'statelessness'.

Clearly there are solutions for versioning and clustering. I've only gone through the map/reduce slides once, but it seems to me that pgpool/pgcluster take very similar approaches when they breaks up queries from the log and send them out to multiple servers to run, then bring the results back together for a result.

It only seems that way. The power of map/reduce is not in the parallelism,
but that it *CAN* be parallelised, every time. Mathematically, the map()
operation is both associative and commutative, which puts it squarely in
the realm of "embarassingly parallel" operations. reduce() is not, but
using the clever key trick Google devised, you can get a reasonably
data-parallel implementation of it and scale up very well just by buying

K -- mathematically associative and commutative -- that is interesting. I didn't get that from the slides.

It's only "application agnostic" because all your
applications already are committed to SQL.
This is true -- but Rails itself is a convention-based framework. We have already surrendered to that principal, and SQL is an open standard.

Being an open standard has nothing to do with it, really. ODF is an open
standard, but I don't see too many databases trying to use that as their
on-disk format ;-)
SQL is a convention *and* an open standard.
( !sql.convention? || !sql.open_standard? ) ? seek_alternative : stay_put

Hadoop and CouchDB are not "databases" so much as they are datastores.
They are not normalized and not bound by relational set algebra
constraints. They are fundamentally geared towards semi-structured data,
whereas RDBMSs are designed for structured data. This is the big
difference. And, to your earlier point, I can't think of a "serious" app
that doesn't have semi-structured data, whereas I can think of some that
have no structured data (e.g. search engine, Wikipedia, a blog, etc).
I'm not really sure what you mean by that last line. Every RDBMS that I've ever worked with had string pattern matching of some sort.

Thanks! -Casey _______________________________________________ To unsubscribe or change your settings, visit: