Handling huge amounts of data

Thu Jun 5 14:00:08 CEST 2003

At 08:40 AM 6/5/03 +0200, you wrote:
>Well, the data is a list (between 500.000 - 1.500.000 items) of chunks of
>data that may be lists of bytes, or binaries and are 27 bytes each. For each
>of them, I need to compare it with all the others in a non-trivial way (some
>1000 similar tests), and select only those that pass the test.

Hmm, interesting that it is always the same size.  Can you say more
about the test?  Is there any similarity in the data content?  Any way
to compress the data that is related to the search mechanism?  Does
the problem map to other similar problems?  How many search requests
do you get and how often?  How long do you have to respond (latency
and thruput)?

>I tried several ways of storing the data in memory, as list of tuples, list
>of binaries, ets table and in all cases the VM becomes slower and slower and
>slower until it breaks.

Lists or tuples should be roughly equivalent.  64-bit mode will only make
the data set larger, unless you use binaries in which case it will have no
difference.

You may need to find a way to do any of the following:

1) Reduce the data set size (thru compression, functional simulation,
        process partitioning, etc)
2) Reduce the number of compares (have you looked at "dynamic
        programming")
3) Translate the data / problem to a different representation
4) Use an external app and connect via a port

erlang is good for development efficiency and code clarity, but not for
memory efficiency.