Erlang, Strings, Mnesia?

Tue Apr 13 16:39:58 CEST 2004

On Tue, 13 Apr 2004, Joe Armstrong wrote:

> On Thu, 8 Apr 2004, Jimmie Houchin wrote:
> >
> > The database will initially have 4-5 million rows/objects.
> > 
> > This data in turn will be searchable. A prime purpose of the data.
> > 
[...]

> > Would Erlang be suitable for a Google type site?
> > An Amazon? An eBay?
> 
>   Based on what you have said so far there is no way of answering your
> question.
> 
>   Before doing *any*  major bit of programming I always  try to do the
> following:
> 
> 	1) Identify the most difficult problem(s) to be solved
> 	2) Prototype solutions the the problems raised in 1)
> 	3) Measure performance of prototype
> 
>   When this has converged you can go ahead and write the program.

Well, as a general rule, I agree with Joe, but the fact remains 
that Mnesia is not terribly good at handling huge volumes of 
persistent data, and also not terribly good at string searching,
nor storing huge amounts of string data in memory. Mnesia's 
disk-based storage is also weak compared to many of its 
competitors (part of which has to do with the wish to be able
to handle variable-size records, something most DBMSes gladly
refuse to do.)

For one thing, mnesia lacks support for building word indices.
This is something that you have to hack yourself if needed.

Mnesia is very good at:

- handling many tables; AXD 301 has > 500 tables in mnesia
- fast access (from Erlang code) to memory-resident data
- distributed processing and fault tolerance
- scalability and low latency in transactions on 
  horizontally fragmented data
- on-line reconfiguration, including in-service upgrade

I'm sure I've forgotten some things.

I know of no one who's actually run a mnesia database containing
6 Gb of data. If you were to try it, you should try to partition
the data into multiple tables, and if you still have very large
tables, try to spread them across multiple processors in smaller
fragments (but only if you can devise a smart access scheme to
avoid whole-table searches.)

My own personal record is a mnesia database totalling > 750 MB,
split into 175 tables, where the largest table contains 1,750,000
records, for a total of 5.8 million records. This seems to work
well, but I've done several rewrites optimizing the access 
patterns and try to store as much string data as atoms rather
than strings (in some cases as binaries). This works, since I
mostly match on whole strings. The Erlang VM plods along rather
happily at 1.3 GB of RAM usage.  (:

(For a while, I went with a file system database, since I doubted
mnesia's ability to cope with the volumes, but ran out of inodes.
After that, I switched to mnesia, and it has worked well so far.)

Finding a good way to cram the amount of data you're talking 
about into mnesia is surely, to cite Joe, one of the HARD bits.

Regarding multi-component environments, well, Joe's argument
is certainly not an argument for dropping Erlang out of the 
equation simply because you might have to go with another 
DBMS. The mainstream way of doing this is certainly to go 
with a multiple tool combo - say PHP, Perl and MySQL
(shudder.)

/Uffe