[erlang-patches] Improving ETS performances

Sverker Eriksson <>
Fri Nov 29 16:22:30 CET 2013


This patch has now been discussed by OTP technical board.

It's a desirable feature but the patch is not acceptable as-is,
mainly due to too large memory overhead per object.

We would need

* No extra overhead for table type 'set'.
* Minimized overhead per object for 'bag' and 'duplicate_bag',
   especially when the nested hash feature is not used.

Another weakness is the lack of yielding when long key chains are
traversed. BIFs should not execute too long but rather yield by
returning back to the emulator loop to be continued later by the
scheduler (as is done by the ets:select operations).
The patch removes this problem for unrelated keys but it still arise
at lookup/delete on a lot of objects with same key.
If an otherwise accepted patch is supplied we can undertake the job
to fix the yielding. It can be good to have in mind though while doing
the design.

Some tips and tricks to minimize memory overhead:
(some of them are probably mutually exclusive)

* Use lower bits in pointers as flags.
* Combine next and kprevious as only one of them is used.
* Move nkitems and lht into a separate root struct that is allocated
   when the first duplicate key is inserted.
* Move knext and kprevious to NestedDbTerm to only allocate them when 
* Use really dirty trick to allocate a mutilated struct with an unused 

typedef struct {
     HashDbTerm* previous;  /* NOT ACCESSABLE FOR SET */
     HashDbTerm* next;
     HashValue  hvalue;
     DbTerm dbterm;

/* For bags */
ptr = malloc(sizeof(HashDbTerm));

/* For sets */
ptr = malloc(sizeof(HashDbTerm) - offsetof(HashDbTerm,next));
ptr = (char*)ptr - offsetof(HashDbTerm,next);

ptr->next = ...;
ptr->hvalue = ...;
ptr->dbterm = ...;

ptr->previous = NULL; /* MEMORY CORRUPTION!!! */

free((char*)ptr + offsetof(HashDbTerm,next));

Maybe there are more sensible ways to accomplish the same thing.

Erlang/OTP @ Ericsson

On 09/10/2013 12:22 PM, Francesco Lattanzio wrote:
> G'day people,
>      I would like to share with you an "improvement" patch, in the hope
> that some day it will be included in the official releases.
> For this is going to be a long e-mail, I have split it into two section,
> a brief one for those in a hurry and a detailed one (an EEPlight) for
> those who want/need to know more.
> NOTE: This e-mail contain some ASCII graphics, please read it using a
>        constant width font.
> The patch targets ETS tables of the 'bag' and 'duplicate_bag' types,
> improving performances of deletion by objects (i.e., ets:delete_object/2
> not ets:delete/2) and insertion when working with many object sharing
> the same key.
> The patch is available at:
> git fetch https://github.com/fltt/otp.git ets_nested_lht
> https://github.com/fltt/otp/compare/erlang:maint...ets_nested_lht
> https://github.com/fltt/otp/compare/erlang:maint...ets_nested_lht.patch
> To get a feel of the performance gain you can run the following snippet
> of code both before and after applying the patch:
> -module(ets_hash_test).
> -export([test2/1]).
> write_loop(N, Tab)
>    when N > 0 ->
>      ok=mnesia:activity(sync_dirty,
>                         fun() ->
>                                 mnesia:write(Tab, {Tab, N, erlang:integer_to_list(N), N rem 10}, write)
>                         end,
>                         mnesia_frag),
>      write_loop(N-1, Tab);
> write_loop(_, _) ->
>      ok.
> verify_loop(N, Tab)
>    when N > 0 ->
>      Expected={Tab, N, erlang:integer_to_list(N), N rem 10},
>      [Expected]=
>          mnesia:activity(sync_dirty,
>                          fun() ->
>                                  mnesia:read(Tab, N, read)
>                          end,
>                          mnesia_frag),
>      verify_loop(N-1, Tab);
> verify_loop(_, _) ->
>      ok.
> test2(TabSize) ->
>      TabName=frag_test,
>      ThisNode=erlang:node(),
>      ok=mnesia:start(),
>      {atomic, ok}=
>          mnesia:create_table(TabName, [{type, set},
>                                        {attributes, [key, datum1, datum2]},
>                                        {frag_properties, [{n_fragments, 1},
>                                                           {node_pool, [ThisNode]}]}]),
>      ok=write_loop(TabSize, TabName),
>      {Time, {atomic, ok}}=
>          timer:tc(fun() ->
>                           mnesia:change_table_frag(TabName, {add_frag, [ThisNode]})
>                   end),
>      ok=verify_loop(TabSize, TabName),
>      stopped=mnesia:stop(),
>      {ok, Time}.
> On my own PC I obtain the following results:
> Before:
>      test2(10000) -> {ok, 1691416}
>      test2(100000) -> {ok, 538015238}
> After:
>      test2(10000) -> {ok, 55239}
>      test2(100000) -> {ok, 745721}
> In other words, this patch is still a work in progress and not
> production ready.
> For more details read on.
> Why this new feature?
> First of all this is not a new feature -- from the point of view of the
> developer there are no new modules, functions, language constructs nor
> options.
> It all started by the need to dynamically add fragments to a NON-EMPTY
> fragmented table (using the mnesia_frag access module).
> As you know, a fragment with as little as 100000 takes far more than
> tollerable to be split.
> Analyzing mnesia's source code, it came out that the culprit is the
> temporary storage of all the write operations under the same key in an
> ETS table of the bag type.
> ETS of the set, bag and duplicate_bag types use linear hashing to store
> objects, thus inserting objects _with different keys_ in an ETS has O(1)
> time complexity.
> However, if all the objects inserted have the same key then time
> complexity changes to O(k), where k is the number of objects sharing the
> same key.
> This is so for, in the current implementation, objects with the same
> keys are all put in a simply linked list and when looking for duplicates
> the whole list must be traversed.
> Risks/uncertain artifacts
> None, as far I can tell _now_.
> However this patch is still experimental and I have tested it only under
> Linux (ArchLinux).
> Moreover, it should be optimized to reduce memory consuption, and some
> code clean-up wouldn't harm.
> How did I solve it?
> Nested linear hash tables. But first some internal data structures must
> be reorganized.
> Reorganization is achieved by use of "bidimensional lists", i.e.,
> something like this:
>                 NULL            NULL            NULL
>                  ^     kprevious ^     kprevious ^
>        kprevious |               |               |
> ----------+   +------+  next  +------+  next  +------+ next
> bucket[i] |-->| Obj1 |------->| Obj3 |------->| Obj4 |----->NULL
> ----------+   +------+        +------+        +------+
>                  ^  | knext   knext |            ^  | knext
>        kprevious |  V               V  kprevious |  V
>                +------+            NULL        +------+
>                | Obj2 |                        | Obj5 |
>                +------+                        +------+
>                     | knext                         | knext
>                     V                               V
>                    NULL                            NULL
> All the ObjXs use the same data structure (a modified HashDbTerm).
> Obj1 and Obj2 share the same key, as well as Obj4 and Obj5 do.
> Obj2 and Obj5's next pointers are unused.
> In the code Obj1, Obj3 and Obj4 are also referenced as "root nodes".
> Why the doubly linked list? As the "next chain" is traversed
> sequentially, adding and removing objects is easy thing.
> Whereas, to achieve the O(1) complexity, the "kprevious/knext chain"
> must be accessible randomly (we'll see how in a moment).
> Also, the order of insertion must be preserved, thus we cannot throw
> away the linked list and substitute it with a hash table.
> Now we are ready to introduce the nested linear hash table.
> When the kprevious/knext chain reach some minimum length
> (DB_BAG_LHT_LEN, which I arbitrarily set at 128), an empty linear hash
> table is created and a pointer to it is installed in the root node (Obj1
> or Obj4, not Obj2 nor Obj5).
> Then it is filled with references to the objects of the kprevious/knext
> chain, keyed with the hash value of the whole objects (Obj1 and Obj2 in
> Obj1's nested LHT or Obj4 and Obj5 in Obj4's nested LHT).
> By now it should be clear how things work: whenever an object is looked
> for, first the next chain is traversed to find the key, then, if a
> nested LHT is installed in the root node found, a lookup is performed on
> the hash value of the object.
> If there's no nested LHT, a sequential traversal is performed.
> Sorry, I exceeded the 200 words limit, but I thing that that is the
> minimum required to understand the whys and hows of this patch.
> There is more to be said, however the source code is more appropriate to
> describe all the minute details of the implementation.
> Also, note that the patch introduces some specially marked comments, like
> this:
> /* ??? some question ??? */
> These are questions and doubts I couldn't answer by myself and that I
> ask you to find an appropriate answer.
> Comments are welcome, especially on how to optimize for speed and
> memory consumption.
> Please, don't hesitate to contact me for further details.
> Have a nice day.

More information about the erlang-patches mailing list