[erlang-questions] LCNT: understanding proc_* and db_hash_slot collisions

Sun Jul 31 11:45:42 CEST 2016

On Tue, Jul 26, 2016 at 12:11 PM, Danil Zagoskin <z@REDACTED> wrote:

> Hello!
>

Hello!

> Inspecting the process gives me
>         lock          id  #tries  #collisions  collisions [%]  time [us]
>  duration [%] histogram [log2(us)]
>        -----         --- ------- ------------ --------------- ----------
> ------------- ---------------------
>  flu_pulsedb   proc_main    6989         4242         60.6954    1161759
>     11.5906 |       ...........XX...       |
>  flu_pulsedb   proc_msgq    7934         3754         47.3154     669383
>      6.6783 |      .xXXxxXxx..xXX...       |
>  flu_pulsedb proc_status    5489          521          9.4917     287153
>      2.8649 |     ..xxxxxxxxxxxXX..        |
>  flu_pulsedb   proc_link     864           44          5.0926        566
>      0.0056 |      ...XxX.....             |
>  flu_pulsedb    proc_btm       0            0          0.0000          0
>      0.0000 |                              |
>  flu_pulsedb  proc_trace     201            0          0.0000          0
>      0.0000 |                              |
>
> Context: this process is a data receiver. Each sender first checks its
> message queue length and then sends a message
>    if queue is not very long. This happens about 220 times a second. Then
> this process accumulates some data and
>    writes it to disk periodically.
> What do proc_main, proc_msgq and proc_status locks mean?
>

proc_main is the main execution lock, proc_msgq is the lock protecting the
external message queue, and proc_status is the lock protecting the process
status ans various other things.

> Why at all are collisions possible here?
>

When doing various operations, different locks are needed in order to
guarantee the order of events. For instance when sending a message the
proc_msgq lock is needed. However when checking the size of the message
queue both the proc_main, proc_msgq are needed. So if many processes
continually check the message queue size of a single other process you will
get a lot of conflict on both the main and msgq lock.

> What should I see next to optimise this bottleneck?
>

Don't check the message queue length of another process unless you really
have to, and if you do have to, do it as seldom as you can. Checking the
length of a message queue is a deceptively expensive operation.

>
> Next, inspecting db_hash_slot gives me 20 rows all alike (only top few
> shown):
>          lock  id  #tries  #collisions  collisions [%]  time [us]
>  duration [%] histogram [log2(us)]
>         ----- --- ------- ------------ --------------- ----------
> ------------- ---------------------
>  db_hash_slot   0     492          299         60.7724     107552
>  1.0730 |              ...XX. .        |
>  db_hash_slot   1     492          287         58.3333     101951
>  1.0171 |            .  ..XX. .        |
>  db_hash_slot  48     480          248         51.6667      99486
>  0.9925 |              ...xXx.         |
>  db_hash_slot  47     480          248         51.6667      96443
>  0.9622 |              ...XXx          |
>  db_hash_slot   2     574          304         52.9617      92952
>  0.9274 |           . ....XX. .        |
>
> How do I see what ETS tables are causing this high collision rate?
> Is there any way to map lock id (here: 0, 1, 48, 47, 2) to a table id?
>

iirc the id used in the lock checker should be the same as the table id.

> Or maybe there is a better tool for ETS profiling?
>

for detection of ETS conflicts there is no better tool. For general ETS
performance, I would use the same tools as for all erlang code, i.e. cprof,
eprof, fprof and friends.

Lukas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160731/441adeea/attachment.htm>