[erlang-patches] Improving ETS performances

Tue Sep 10 12:22:35 CEST 2013

G'day people,
    I would like to share with you an "improvement" patch, in the hope
that some day it will be included in the official releases.
For this is going to be a long e-mail, I have split it into two section,
a brief one for those in a hurry and a detailed one (an EEPlight) for
those who want/need to know more.

NOTE: This e-mail contain some ASCII graphics, please read it using a
      constant width font.

BRIEF SECTION

The patch targets ETS tables of the 'bag' and 'duplicate_bag' types,
improving performances of deletion by objects (i.e., ets:delete_object/2
not ets:delete/2) and insertion when working with many object sharing
the same key.

The patch is available at:

git fetch https://github.com/fltt/otp.git ets_nested_lht

https://github.com/fltt/otp/compare/erlang:maint...ets_nested_lht
https://github.com/fltt/otp/compare/erlang:maint...ets_nested_lht.patch

To get a feel of the performance gain you can run the following snippet
of code both before and after applying the patch:

-module(ets_hash_test).
-export([test2/1]).

write_loop(N, Tab)
  when N > 0 ->
    ok=mnesia:activity(sync_dirty,
                       fun() ->
                               mnesia:write(Tab, {Tab, N, erlang:integer_to_list(N), N rem 10}, write)
                       end,
                       mnesia_frag),
    write_loop(N-1, Tab);
write_loop(_, _) ->
    ok.

verify_loop(N, Tab)
  when N > 0 ->
    Expected={Tab, N, erlang:integer_to_list(N), N rem 10},
    [Expected]=
        mnesia:activity(sync_dirty,
                        fun() ->
                                mnesia:read(Tab, N, read)
                        end,
                        mnesia_frag),
    verify_loop(N-1, Tab);
verify_loop(_, _) ->
    ok.

test2(TabSize) ->
    TabName=frag_test,
    ThisNode=erlang:node(),
    ok=mnesia:start(),
    {atomic, ok}=
        mnesia:create_table(TabName, [{type, set},
                                      {attributes, [key, datum1, datum2]},
                                      {frag_properties, [{n_fragments, 1},
                                                         {node_pool, [ThisNode]}]}]),
    ok=write_loop(TabSize, TabName),
    {Time, {atomic, ok}}=
        timer:tc(fun() ->
                         mnesia:change_table_frag(TabName, {add_frag, [ThisNode]})
                 end),
    ok=verify_loop(TabSize, TabName),
    stopped=mnesia:stop(),
    {ok, Time}.

On my own PC I obtain the following results:

Before:
    test2(10000) -> {ok, 1691416}
    test2(100000) -> {ok, 538015238}

After:
    test2(10000) -> {ok, 55239}
    test2(100000) -> {ok, 745721}

A last note: THIS IS EXPERIMENTAL CODE, USE IT AT YOUR OWN RISK.
In other words, this patch is still a work in progress and not
production ready.
For more details read on.

DETAILED SECTION (EPPlight)

Why this new feature?

First of all this is not a new feature -- from the point of view of the
developer there are no new modules, functions, language constructs nor
options.

It all started by the need to dynamically add fragments to a NON-EMPTY
fragmented table (using the mnesia_frag access module).
As you know, a fragment with as little as 100000 takes far more than
tollerable to be split.

Analyzing mnesia's source code, it came out that the culprit is the
temporary storage of all the write operations under the same key in an
ETS table of the bag type.
ETS of the set, bag and duplicate_bag types use linear hashing to store
objects, thus inserting objects _with different keys_ in an ETS has O(1)
time complexity.
However, if all the objects inserted have the same key then time
complexity changes to O(k), where k is the number of objects sharing the
same key.
This is so for, in the current implementation, objects with the same
keys are all put in a simply linked list and when looking for duplicates
the whole list must be traversed.

Risks/uncertain artifacts

None, as far I can tell _now_.
However this patch is still experimental and I have tested it only under
Linux (ArchLinux).
Moreover, it should be optimized to reduce memory consuption, and some
code clean-up wouldn't harm.

How did I solve it?

Nested linear hash tables. But first some internal data structures must
be reorganized.
Reorganization is achieved by use of "bidimensional lists", i.e.,
something like this:

               NULL            NULL            NULL
                ^     kprevious ^     kprevious ^
      kprevious |               |               |
----------+   +------+  next  +------+  next  +------+ next
bucket[i] |-->| Obj1 |------->| Obj3 |------->| Obj4 |----->NULL
----------+   +------+        +------+        +------+
                ^  | knext   knext |            ^  | knext
      kprevious |  V               V  kprevious |  V
              +------+            NULL        +------+
              | Obj2 |                        | Obj5 |
              +------+                        +------+
                   | knext                         | knext
                   V                               V
                  NULL                            NULL

All the ObjXs use the same data structure (a modified HashDbTerm).
Obj1 and Obj2 share the same key, as well as Obj4 and Obj5 do.
Obj2 and Obj5's next pointers are unused.
In the code Obj1, Obj3 and Obj4 are also referenced as "root nodes".

Why the doubly linked list? As the "next chain" is traversed
sequentially, adding and removing objects is easy thing.
Whereas, to achieve the O(1) complexity, the "kprevious/knext chain"
must be accessible randomly (we'll see how in a moment).
Also, the order of insertion must be preserved, thus we cannot throw
away the linked list and substitute it with a hash table.

Now we are ready to introduce the nested linear hash table.
When the kprevious/knext chain reach some minimum length
(DB_BAG_LHT_LEN, which I arbitrarily set at 128), an empty linear hash
table is created and a pointer to it is installed in the root node (Obj1
or Obj4, not Obj2 nor Obj5).
Then it is filled with references to the objects of the kprevious/knext
chain, keyed with the hash value of the whole objects (Obj1 and Obj2 in
Obj1's nested LHT or Obj4 and Obj5 in Obj4's nested LHT).

By now it should be clear how things work: whenever an object is looked
for, first the next chain is traversed to find the key, then, if a
nested LHT is installed in the root node found, a lookup is performed on
the hash value of the object.
If there's no nested LHT, a sequential traversal is performed.

Sorry, I exceeded the 200 words limit, but I thing that that is the
minimum required to understand the whys and hows of this patch.

There is more to be said, however the source code is more appropriate to
describe all the minute details of the implementation.
Also, note that the patch introduces some specially marked comments, like
this:

/* ??? some question ??? */

These are questions and doubts I couldn't answer by myself and that I
ask you to find an appropriate answer.

Comments are welcome, especially on how to optimize for speed and
memory consumption.
Please, don't hesitate to contact me for further details.

Have a nice day.

-- 
FRANCESCO LATTANZIO : SYSTEM & SOFTWARE
A-TONO TECHNOLOGY : VIA DEL CHIESINO, 10 - 56025 PONTEDERA (PI) : T +39 0587 59221 : F +39 0587 59221 : SKYPE franz.lattanzio
a-tono.com : twitter.com/ATonoOfficial : facebook.com/ATonoOfficial

Information in this email is confidential and may be privileged. It is intended for the addresses only.
If you have received it in error, please notify the sender immediately and delete it from your system.
You should not otherwise copy it, retransmit it or use or disclose its content to anyone. Thank you for your co-operation.