[erlang-questions] LCNT: understanding proc_* and db_hash_slot collisions

Tue Jul 26 12:11:28 CEST 2016

Hello!

I am inspecting performance issues in an Erlang/OTP system.

I've run LCNT and I see results I cannot interpret.

After lcnt:swap_pid_keys(), I see these conflicts (only top of the table):
               lock   id  #tries  #collisions  collisions [%]  time [us]
 duration [%]
              -----  --- ------- ------------ --------------- ----------
-------------
       db_hash_slot  896  192708        28684         14.8847    8164589
    81.4562
        flu_pulsedb    6   21477         8561         39.8612    2118861
    21.1394
          run_queue   42 6427021       169979          2.6448     339177
     3.3839
           pix_lock 1024    4384          229          5.2235     209468
     2.0898
            pollset    1  303839        16477          5.4229     134808
     1.3449
             db_tab  146  261929          926          0.3535     107955
     1.0770

Here I see two problems: lots of ETS locking (db_hash_slot) and something
with a particular process (flu_pulsedb).

Inspecting the process gives me
        lock          id  #tries  #collisions  collisions [%]  time [us]
 duration [%] histogram [log2(us)]
       -----         --- ------- ------------ --------------- ----------
------------- ---------------------
 flu_pulsedb   proc_main    6989         4242         60.6954    1161759
    11.5906 |       ...........XX...       |
 flu_pulsedb   proc_msgq    7934         3754         47.3154     669383
     6.6783 |      .xXXxxXxx..xXX...       |
 flu_pulsedb proc_status    5489          521          9.4917     287153
     2.8649 |     ..xxxxxxxxxxxXX..        |
 flu_pulsedb   proc_link     864           44          5.0926        566
     0.0056 |      ...XxX.....             |
 flu_pulsedb    proc_btm       0            0          0.0000          0
     0.0000 |                              |
 flu_pulsedb  proc_trace     201            0          0.0000          0
     0.0000 |                              |

Context: this process is a data receiver. Each sender first checks its
message queue length and then sends a message
   if queue is not very long. This happens about 220 times a second. Then
this process accumulates some data and
   writes it to disk periodically.
What do proc_main, proc_msgq and proc_status locks mean? Why at all are
collisions possible here?
What should I see next to optimise this bottleneck?

Next, inspecting db_hash_slot gives me 20 rows all alike (only top few
shown):
         lock  id  #tries  #collisions  collisions [%]  time [us]  duration
[%] histogram [log2(us)]
        ----- --- ------- ------------ --------------- ----------
------------- ---------------------
 db_hash_slot   0     492          299         60.7724     107552
 1.0730 |              ...XX. .        |
 db_hash_slot   1     492          287         58.3333     101951
 1.0171 |            .  ..XX. .        |
 db_hash_slot  48     480          248         51.6667      99486
 0.9925 |              ...xXx.         |
 db_hash_slot  47     480          248         51.6667      96443
 0.9622 |              ...XXx          |
 db_hash_slot   2     574          304         52.9617      92952
 0.9274 |           . ....XX. .        |

How do I see what ETS tables are causing this high collision rate?
Is there any way to map lock id (here: 0, 1, 48, 47, 2) to a table id?
Or maybe there is a better tool for ETS profiling?

-- 
Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160726/d4655e0b/attachment.htm>