MNesia Distribution Questions

Wed Oct 21 16:18:28 CEST 2009

Hi there folks,
Very quickly about me:
- I'm doing a Masters in Software Engineering
- I'm looking closely at Mnesia to involve it in my study

OK, so here's where I'm at. It's quite early on in my report, but I'm
getting a feel for what could be a useful investigation. I am briefly
discussing cloud computing, and then in more detail at distributed
computation and data storage (along with fault tolerance, high availability
etc...). My supervisor and I have agreed that to illustrate the use of a
distributed computation system, I am going to perform some large computation
on a large dataset, probably using Pig (dataflow language atop of Hadoop). I
have been given the university cluster to deploy Hadoop.

(Bear with me....)

So... now the big question for me right now is where to find a
useful competitor (or rather a solution with similar goals). The easy option
would be to compare Pig to another Hadoop interface, i.e. Hive, but those
results would be pretty uninteresting). So instead, I'm looking into the
realm of distributed databases. Now... as far as I'm concerned, the way in
which Mnesia distributes the availability of data across  nodes is similar
to how Hadoop distributes data across the HDFS (Hadoop file system) across
nodes). My issue here is, my lack of understanding on how a data query
computation is distributed over a network of Mnesia nodes. I have a good
understanding of how this is achieved with Hadoop (if there are 10
datanodes, then each will get a tenth of the work), but is there such a
thing as parallel query processing with MNesia? Or... is MNesia just a way
to very very quickly replicate the availability of data.

I hope that you guru's can shed some light on this for me. I'm not aware of
exactly how MNesia would deal with a data query where the MNesia network
consists of say, 10 nodes? Does a user query just one of the 10, or does a
user query the network? I'm really trying to think of a fair and interesting
way to compare the concept of a distributed database (MNesia), against a
distributed processing engine (Hadoop).

There are other things I want to delve into also... For instance, I really
need to know more about the difference between CouchDB and MNesia. So far, I
can only establish that CouchDB is more useful for networks where nodes are
likely to go offline at various times. (Not much knowledge!!).

If, however, comparing a distributed database engine against a distributed
processing engine is a non starter, let me know of that too !!

Many thanks, I would really appreciate some feedback.

Rob Stewart