[erlang-questions] ets vs list

Tue Sep 14 02:40:38 CEST 2010

On 13 September 2010 21:05, Morten Krogh <mk@REDACTED> wrote:
>
>> No, there are several other overheads. For instance, you must somehow
>> implement a hash table from immutable dynamically typed data structures,
>> whereas in ETS (in C), you can use mutable arrays. The dict module,
>> for example, uses a tree structure of tuples. Once you've located
>> the leaf tuple where the object is to reside, you have to create
>> a new copy of that tuple, which means you must also update (copy)
>> the parent tuple, etc.
>
> But it is to some extent self imposed that you want a functional data
> structure.
> You could have used the process dictionary for this or some foreign
> interface.
> Nowadays nif, but that was not available to you then.
>
> Why didn't you use the process dictionary? It is made for this situation.

The problem comes back to memory management and garbage collection. A
process heap is not collected incrementally but in one go. During that
time the cpu/core/thread running that process will stop. This is no
problem as most process heaps are not large and there is no noticeable
pause. Also the collector is optimised to keep these times down. It
has to be done in this way as there no other way to detect when terms
are free except by scanning the heap.*

In an ets table heap however, as you do explicit insert and delete,
collection takes part every time you do a delete so there are no long
pauses at all. This is irrespective of how big the ets table is, which
means is to possible to have LARGE tables without any pause times for
collection. But this also means that the data in an ets table cannot
exist in a normal process heap but must be separate from it. This is
why you get the copying of data between processes and ets tables. This
is why you have ets:match and ets:select which allow you to do more
checking of data in the ets pseudo-process before copying it over to a
process heap, which is a big win when scanning a table.

Now using the process dictionary would allow you to use the same
algorithms for managing the table itself as with ets tables but it
doesn't solve the big problem of where to store the actual data.
Either you keep it in the process heap (which is done today) but then
incur the all problems with having LARGE process heaps, or you keep it
in a separate heap but then incur all the problems with copying. In
either case there is not a clear benefit in using the process
dictionary.

You get the same problem if you were to try to use NIFs. Where do you
put the data? The process heap should only contain valid erlang terms
and does not support mutable data, which limits how much smarter you
can actually make it by not using plain erlang. Though you could
probably improve dict by adding some NIFs, definitely something to
consider for the future.

There is another point as well and that is that using the process
dictionary gives the code side effects and makes it non-functional.
Ah, you might say, but does not using ets do the same thing? Yes, but
ets tables behave as processes (and can be emulated using processes)
which confines all side effects to inter-process communication. Which
is a Good Thing. If this worries you is another thing but it is nice
to be able to follow the flow of state through your code as much as
possible.

As usual it all boils down to which trade-offs you are willing to
make. Ets tables and dict/gb_trees give you two different choices with
different properties.

Sorry this became a bit longer than I had originally intended.

Robert

* Yes, there are garbage collection algorithms that incrementally
collect data which would allow you to have LARGE process heaps without
incurring long pause for collection. These are, however, more complex
and costly and it is not clear how much you actually gain by using
them. Multi-core environments don't help either in this respect.