[erlang-questions] MNesia Distribution Questions

Wed Oct 21 19:00:08 CEST 2009

Hi Roberto,

Very useful links, thanks.

>From that, I really just have two telling questions, that I have not, as
yet, found a clear answer to.

1. And example - I have a compex query to apply to a dataset, of, say
10,000,000 rows in an MNesia database. This database is replicated over 10
nodes in a network. Will the query be split for equal computation to each of
the 10 nodes, or will the query be executed on either one random or one
selected MNesia node.

2. How, in a most interesting way, could a distributed database, like MNesia
be compared (in detail) to a distributed processing engine like Hadoop or
Dryad?  Taking the example from part 1, I know that Hadoop will map/reduce
the job across the 10 nodes for parallel processing. And either Pig or Hive
or HBase provide a database-like interface to the Hadoop data. So does
MNesia share any common goals with Hadoop and such like?

Regards,

Rob Stewart

2009/10/21 Roberto Aloi <roberto.aloi@REDACTED>

> Hi Rob,
>
> I guess the best I can do is pointing you to some useful resources:
>
> - http://www.erlang.se/publications/mnesia_overview.pdf
> - http://www.erlang.org/doc/apps/mnesia/Mnesia_chap1.html#1
> - http://couchdb.apache.org/
> - http://oreilly.com/catalog/9780596158163
> - http://wiki.apache.org/couchdb/Frequently_asked_questions#why_no_mnesia
>
> I advise you to give a quick look to these references and, if you still are
> in doubt, ask the mailing list.
> Hope to be useful.
>
> Best regards,
>
> Roberto Aloi
> roberto.aloi@REDACTED
> http://www.erlang-consulting.com
> ---
>
>
>
>
>
> On Oct 21, 2009, at 3:18 PM, Rob Stewart wrote:
>
>  Hi there folks,
>> Very quickly about me:
>> - I'm doing a Masters in Software Engineering
>> - I'm looking closely at Mnesia to involve it in my study
>>
>>
>> OK, so here's where I'm at. It's quite early on in my report, but I'm
>> getting a feel for what could be a useful investigation. I am briefly
>> discussing cloud computing, and then in more detail at distributed
>> computation and data storage (along with fault tolerance, high
>> availability
>> etc...). My supervisor and I have agreed that to illustrate the use of a
>> distributed computation system, I am going to perform some large
>> computation
>> on a large dataset, probably using Pig (dataflow language atop of Hadoop).
>> I
>> have been given the university cluster to deploy Hadoop.
>>
>> (Bear with me....)
>>
>> So... now the big question for me right now is where to find a
>> useful competitor (or rather a solution with similar goals). The easy
>> option
>> would be to compare Pig to another Hadoop interface, i.e. Hive, but those
>> results would be pretty uninteresting). So instead, I'm looking into the
>> realm of distributed databases. Now... as far as I'm concerned, the way in
>> which Mnesia distributes the availability of data across  nodes is similar
>> to how Hadoop distributes data across the HDFS (Hadoop file system) across
>> nodes). My issue here is, my lack of understanding on how a data query
>> computation is distributed over a network of Mnesia nodes. I have a good
>> understanding of how this is achieved with Hadoop (if there are 10
>> datanodes, then each will get a tenth of the work), but is there such a
>> thing as parallel query processing with MNesia? Or... is MNesia just a way
>> to very very quickly replicate the availability of data.
>>
>> I hope that you guru's can shed some light on this for me. I'm not aware
>> of
>> exactly how MNesia would deal with a data query where the MNesia network
>> consists of say, 10 nodes? Does a user query just one of the 10, or does a
>> user query the network? I'm really trying to think of a fair and
>> interesting
>> way to compare the concept of a distributed database (MNesia), against a
>> distributed processing engine (Hadoop).
>>
>> There are other things I want to delve into also... For instance, I really
>> need to know more about the difference between CouchDB and MNesia. So far,
>> I
>> can only establish that CouchDB is more useful for networks where nodes
>> are
>> likely to go offline at various times. (Not much knowledge!!).
>>
>> If, however, comparing a distributed database engine against a distributed
>> processing engine is a non starter, let me know of that too !!
>>
>> Many thanks, I would really appreciate some feedback.
>>
>>
>> Rob Stewart
>>
>
>