Erlang, Strings, Mnesia?

Ulf Wiger ulf.wiger@REDACTED
Thu Apr 8 23:43:20 CEST 2004


On Thu, 08 Apr 2004 09:53:02 -0500, Jimmie Houchin <jhouchin@REDACTED> 
wrote:

> Hello,
>
> I am trying to understand whether or not Erlang is right or a good 
> option for a website I am building.
>
> Currently I am processing 422,000 files with 6.6gb of text to place in a 
> database. I will have to regularly parse and process large volumes of 
> textual data.

I guess that mySql would do a better job in this case.


> The database will initially have 4-5 million rows/objects.

I've had mnesia databases with >10 million objects. The number isn't
really the issue, but the memory requirements are. If the database can
fit in primary memory (which in practice means no more than, say 3GB
of data, since Erlang currently only addresses up to 4 GB), then you
face the problem that mnesia's disk storage was not designed to handle
such large volumes.

Also, text expands 8x if you represent it as a list of integers
(the normal "string representation" in Erlang), so if you want to
squeeze the data, you'd have to store the strings as

- atoms, might work if you know that there's a finite string space
   (the atom table is not garbage collected.) Also, you must convert
   the atoms to strings (atom_to_list()) before you can analyse them.
- binaries, which means no space explosion, and you can still
   process the binaries directly using the bit syntax, but debugging
   becomes a bit more awkward; I also have doubts about the GC
   characteristics if you really go overboard using gigabytes in
   millions of binaries.


> This data in turn will be searchable. A prime purpose of the data.

Assuming the data could fit in ets tables, the ets:select() feature
is powerful, but mainly for structured, symbolic data. It gives
little support for string processing. This means that for partial
string matching, you generally need to read every object and scan
it using a regular Erlang program.

Not sure which type of database manager would be best. There are some
tools that build indices on words. One database manager that does
this is FileMaker, which also has some nice web support, and some
(fairly limited) relational functionality. The feature that it
indexes on every word in every table field is sometimes annoying, but
it makes for some lightning-fast keyword searches.
The newest version of FileMaker supports 2GB per field and 8TB per
database. http://www.pcmag.com/article2/0,1759,1548051,00.asp
Note, though, that it's not free. (I have no personal interest
in FileMaker, except that I used it in a previous life, and
liked it very much.)


> Would Erlang be suitable for a Google type site?
> An Amazon? An eBay?

Probably no.

/Uffe
-- 
Ulf Wiger




More information about the erlang-questions mailing list