Toby DiPasquale on 5 Sep 2007 21:16:22 -0000

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PhillyOnRails] jruby + hadoop?

On Wed, Sep 05, 2007 at 03:15:39PM -0400, Cassius Rosenthal wrote:
> >Just what Oracle wants you to think: "This is serious app. It needs
> >serious database."

Hear, hear!!! +1.

> Well . . . yeah.  But I use postgresql.  (^_^)
> >In my view pretty much anything beats having all your data locked up
> >in a schema'd but not versioned, highly stateful, monolithic DB
> >process. 
> I'm not sure why you would object to the stateful aspect of a DB -- 
> either an app needs stateful data, or it does not, right?  

State == bad. Whenever you can avoid state, you're a winner. Notice all
the pain of RDBMSs versus HTTP. Which of those two is stateless?

> Clearly there 
> are solutions for versioning and clustering.  I've only gone through the 
> map/reduce slides once, but it seems to me that pgpool/pgcluster take 
> very similar approaches when they breaks up queries from the log and 
> send them out to multiple servers to run, then bring the results back 
> together for a result.

It only seems that way. The power of map/reduce is not in the parallelism,
but that it *CAN* be parallelised, every time. Mathematically, the map()
operation is both associative and commutative, which puts it squarely in
the realm of "embarassingly parallel" operations. reduce() is not, but
using the clever key trick Google devised, you can get a reasonably
data-parallel implementation of it and scale up very well just by buying

> >It's only "application agnostic" because all your
> >applications already are committed to SQL.
> This is true -- but Rails itself is a convention-based framework.  We 
> have already surrendered to that principal, and SQL is an open standard.

Being an open standard has nothing to do with it, really. ODF is an open
standard, but I don't see too many databases trying to use that as their
on-disk format ;-)

Rails is based on code conventions and tries very hard to mask over the
SQL parts. As such, it should be fairly easy to replace that backend with
something else, provided that backend replicates what SQL provides.

However, and this is the key point: CouchDB and Hadoop do not replicate
what RDBMSs provide. They give you something different. Your earlier
statement of seeing how map/reduce does against optimized PSQL statements
is a bit off, in that you'd never actually run that test. 

Hadoop and CouchDB are not "databases" so much as they are datastores.
They are not normalized and not bound by relational set algebra
constraints. They are fundamentally geared towards semi-structured data,
whereas RDBMSs are designed for structured data. This is the big
difference. And, to your earlier point, I can't think of a "serious" app
that doesn't have semi-structured data, whereas I can think of some that
have no structured data (e.g. search engine, Wikipedia, a blog, etc).

> XUL embraces RDF as datasources for its template engine, and I'm pretty 
> sure that I'm not alone when I opine that RDF is the biggest obstacle 
> for XUL developers.  It is as awkward as the spelling of 'awkward'.  I 
> admit that I have little experience with cases that would be best served 
> by RDF, but wouldn't those cases be contrary to MVC anyway?  When 
> inference logic is in the data tree, is the controller on a cigarette break?

I have only this to say: MVC is not the only design pattern available and
Web applications as they currently exist are not the only things one might
wish to build. This latter category includes Web applications of tomorrow.

> I'm sure there are many other grammars that would be non-trivial to 
> extract from SQL as well, but I don't see that any of them would be 
> superior for common/general use.  On the flip side, I don't see any 
> argument proving that SQL is a best-fit for common/general use either, 
> but since it is an open standard, that argument doesn't need to be 
> made.  At the very least, I can say that it is just as awkward to 
> extract SQL-like tables from RDF as it is to go in the opposite direction.
> I can say this for Oracle and Postgres: on both I have seen extremely 
> impressive number crunching, data integrity, and great flexibility in 
> data presentation.  On Oracle, the performance boosts that you can get 
> by optimizing query statistics is fascinating, to say the least.  I 
> don't see how map/reduce could even theoretically provide the same 
> degree of optimization, because when all of the parts are sent out to 
> nodes to be solved, it depends on the slowest node, not on the 
> intelligence of the master to preselect the order in which the 
> conditions should be resolved.  Good for performing a general 
> computation, but not good for retrieving data that we know something 
> about.  Of course, Google is using map/reduce -- so I'd appreciate it if 
> someone could tell me why I am wrong.

I hope to do that at my talk on the subject.

> I would like to see more examples of CouchDB being used where it clearly 
> makes more sense than a RDBMS.  Right now, I think I would want to use 
> it for quick-and-dirty Rails apps (which could be the majority of web 
> applications), but not for complex apps.

It actually wouldn't be all that great for what I can see in your head
here. Please make sure you come to my talk on MapReduce and Hadoop.

> Would CouchDB make sense as a filesystem?


Toby DiPasquale
To unsubscribe or change your settings, visit: