[erlang-patches] Improving ETS performances

Wed Sep 11 09:24:37 CEST 2013

On 09/10/2013 12:22 PM, Francesco Lattanzio wrote:
> G'day people,
>      I would like to share with you an "improvement" patch, in the hope
> that some day it will be included in the official releases.
> For this is going to be a long e-mail, I have split it into two section,
> a brief one for those in a hurry and a detailed one (an EEPlight) for
> those who want/need to know more.
>
> NOTE: This e-mail contain some ASCII graphics, please read it using a
>        constant width font.
>
> BRIEF SECTION
>
> The patch targets ETS tables of the 'bag' and 'duplicate_bag' types,
> improving performances of deletion by objects (i.e., ets:delete_object/2
> not ets:delete/2) and insertion when working with many object sharing
> the same key.
>
> The patch is available at:
>
> git fetch https://github.com/fltt/otp.git ets_nested_lht
>
> https://github.com/fltt/otp/compare/erlang:maint...ets_nested_lht
> https://github.com/fltt/otp/compare/erlang:maint...ets_nested_lht.patch
>
> To get a feel of the performance gain you can run the following snippet
> of code both before and after applying the patch:
>
>
> -module(ets_hash_test).
> -export([test2/1]).
>
> write_loop(N, Tab)
>    when N>  0 ->
>      ok=mnesia:activity(sync_dirty,
>                         fun() ->
>                                 mnesia:write(Tab, {Tab, N, erlang:integer_to_list(N), N rem 10}, write)
>                         end,
>                         mnesia_frag),
>      write_loop(N-1, Tab);
> write_loop(_, _) ->
>      ok.
>
> verify_loop(N, Tab)
>    when N>  0 ->
>      Expected={Tab, N, erlang:integer_to_list(N), N rem 10},
>      [Expected]=
>          mnesia:activity(sync_dirty,
>                          fun() ->
>                                  mnesia:read(Tab, N, read)
>                          end,
>                          mnesia_frag),
>      verify_loop(N-1, Tab);
> verify_loop(_, _) ->
>      ok.
>
> test2(TabSize) ->
>      TabName=frag_test,
>      ThisNode=erlang:node(),
>      ok=mnesia:start(),
>      {atomic, ok}=
>          mnesia:create_table(TabName, [{type, set},
>                                        {attributes, [key, datum1, datum2]},
>                                        {frag_properties, [{n_fragments, 1},
>                                                           {node_pool, [ThisNode]}]}]),
>      ok=write_loop(TabSize, TabName),
>      {Time, {atomic, ok}}=
>          timer:tc(fun() ->
>                           mnesia:change_table_frag(TabName, {add_frag, [ThisNode]})
>                   end),
>      ok=verify_loop(TabSize, TabName),
>      stopped=mnesia:stop(),
>      {ok, Time}.
>
>
> On my own PC I obtain the following results:
>
> Before:
>      test2(10000) ->  {ok, 1691416}
>      test2(100000) ->  {ok, 538015238}
>
> After:
>      test2(10000) ->  {ok, 55239}
>      test2(100000) ->  {ok, 745721}
>
> A last note: THIS IS EXPERIMENTAL CODE, USE IT AT YOUR OWN RISK.
> In other words, this patch is still a work in progress and not
> production ready.
> For more details read on.
>
> DETAILED SECTION (EPPlight)
>
> Why this new feature?
>
> First of all this is not a new feature -- from the point of view of the
> developer there are no new modules, functions, language constructs nor
> options.
>
> It all started by the need to dynamically add fragments to a NON-EMPTY
> fragmented table (using the mnesia_frag access module).
> As you know, a fragment with as little as 100000 takes far more than
> tollerable to be split.
>
> Analyzing mnesia's source code, it came out that the culprit is the
> temporary storage of all the write operations under the same key in an
> ETS table of the bag type.
> ETS of the set, bag and duplicate_bag types use linear hashing to store
> objects, thus inserting objects _with different keys_ in an ETS has O(1)
> time complexity.
> However, if all the objects inserted have the same key then time
> complexity changes to O(k), where k is the number of objects sharing the
> same key.
> This is so for, in the current implementation, objects with the same
> keys are all put in a simply linked list and when looking for duplicates
> the whole list must be traversed.
>
> Risks/uncertain artifacts
>
> None, as far I can tell _now_.
> However this patch is still experimental and I have tested it only under
> Linux (ArchLinux).
> Moreover, it should be optimized to reduce memory consuption, and some
> code clean-up wouldn't harm.
>
> How did I solve it?
>
> Nested linear hash tables. But first some internal data structures must
> be reorganized.
> Reorganization is achieved by use of "bidimensional lists", i.e.,
> something like this:
>
>                 NULL            NULL            NULL
>                  ^     kprevious ^     kprevious ^
>        kprevious |               |               |
> ----------+   +------+  next  +------+  next  +------+ next
> bucket[i] |-->| Obj1 |------->| Obj3 |------->| Obj4 |----->NULL
> ----------+   +------+        +------+        +------+
>                  ^  | knext   knext |            ^  | knext
>        kprevious |  V               V  kprevious |  V
>                +------+            NULL        +------+
>                | Obj2 |                        | Obj5 |
>                +------+                        +------+
>                     | knext                         | knext
>                     V                               V
>                    NULL                            NULL
>
> All the ObjXs use the same data structure (a modified HashDbTerm).
> Obj1 and Obj2 share the same key, as well as Obj4 and Obj5 do.
> Obj2 and Obj5's next pointers are unused.
> In the code Obj1, Obj3 and Obj4 are also referenced as "root nodes".
>
> Why the doubly linked list? As the "next chain" is traversed
> sequentially, adding and removing objects is easy thing.
> Whereas, to achieve the O(1) complexity, the "kprevious/knext chain"
> must be accessible randomly (we'll see how in a moment).
> Also, the order of insertion must be preserved, thus we cannot throw
> away the linked list and substitute it with a hash table.
>
> Now we are ready to introduce the nested linear hash table.
> When the kprevious/knext chain reach some minimum length
> (DB_BAG_LHT_LEN, which I arbitrarily set at 128), an empty linear hash
> table is created and a pointer to it is installed in the root node (Obj1
> or Obj4, not Obj2 nor Obj5).
> Then it is filled with references to the objects of the kprevious/knext
> chain, keyed with the hash value of the whole objects (Obj1 and Obj2 in
> Obj1's nested LHT or Obj4 and Obj5 in Obj4's nested LHT).
>
> By now it should be clear how things work: whenever an object is looked
> for, first the next chain is traversed to find the key, then, if a
> nested LHT is installed in the root node found, a lookup is performed on
> the hash value of the object.
> If there's no nested LHT, a sequential traversal is performed.
>
> Sorry, I exceeded the 200 words limit, but I thing that that is the
> minimum required to understand the whys and hows of this patch.
>
> There is more to be said, however the source code is more appropriate to
> describe all the minute details of the implementation.
> Also, note that the patch introduces some specially marked comments, like
> this:
>
> /* ??? some question ??? */
>
> These are questions and doubts I couldn't answer by myself and that I
> ask you to find an appropriate answer.
>
> Comments are welcome, especially on how to optimize for speed and
> memory consumption.
> Please, don't hesitate to contact me for further details.
>
> Have a nice day.
>
Hello Francesco,
I've fetched your patch and assigned it to be reviewed by the 
responsible team.
Thanks,

-- 

BR Fredrik Gustafsson
Erlang OTP Team