[erlang-questions] LCNT: understanding proc_* and db_hash_slot collisions
Lukas Larsson
garazdawi@REDACTED
Thu Aug 18 10:27:08 CEST 2016
On Tue, Aug 16, 2016 at 12:28 PM, Danil Zagoskin <z@REDACTED> wrote:
> Next, inspecting db_hash_slot gives me 20 rows all alike (only top few
>>> shown):
>>> lock id #tries #collisions collisions [%] time [us]
>>> duration [%] histogram [log2(us)]
>>> ----- --- ------- ------------ --------------- ----------
>>> ------------- ---------------------
>>> db_hash_slot 0 492 299 60.7724 107552
>>> 1.0730 | ...XX. . |
>>> db_hash_slot 1 492 287 58.3333 101951
>>> 1.0171 | . ..XX. . |
>>> db_hash_slot 48 480 248 51.6667 99486
>>> 0.9925 | ...xXx. |
>>> db_hash_slot 47 480 248 51.6667 96443
>>> 0.9622 | ...XXx |
>>> db_hash_slot 2 574 304 52.9617 92952
>>> 0.9274 | . ....XX. . |
>>>
>>> How do I see what ETS tables are causing this high collision rate?
>>> Is there any way to map lock id (here: 0, 1, 48, 47, 2) to a table id?
>>>
>>
>> iirc the id used in the lock checker should be the same as the table id.
>>
>
> Unfortunately, the lock equals a table's hash lock id:
> https://github.com/erlang/otp/blob/maint/erts/emulator/beam/
> erl_db_hash.c#L687
> After changing make_small(i) to tb->common.the_name we were able to see
> the table name causing locking:
>
> (flussonic@REDACTED)22> lcnt:inspect(db_hash_slot, [{max_locks, 10}]).
> lock id #tries #collisions collisions [%] time [us] duration [%] histogram [log2(us)]
> ----- --- ------- ------------ --------------- ---------- ------------- ---------------------
> db_hash_slot pulsedb_seconds_data 523 78 14.9140 26329 0.5265 | .. .XXX .. |
> db_hash_slot pulsedb_seconds_data 498 77 15.4618 24210 0.4841 | ...xXX. . |
> db_hash_slot pulsedb_seconds_data 524 62 11.8321 23082 0.4616 | . ..XX. .. |
> db_hash_slot pulsedb_seconds_data 489 74 15.1329 21425 0.4284 | ...XX. . |
> db_hash_slot pulsedb_seconds_data 493 79 16.0243 19918 0.3983 | ... .xXX. |
> db_hash_slot pulsedb_seconds_data 518 67 12.9344 19298 0.3859 | ....XX.. |
> db_hash_slot pulsedb_seconds_data 595 70 11.7647 18947 0.3789 | . ..XX. |
> db_hash_slot pulsedb_seconds_data 571 74 12.9597 18638 0.3727 | ....XX. |
> db_hash_slot pulsedb_seconds_data 470 61 12.9787 17818 0.3563 | .....XX... |
> db_hash_slot pulsedb_seconds_data 475 75 15.7895 17582 0.3516 | xXX. |
> ok
>
>
>
> Should I create a PR for that?
> The result is not perfect — it could be better to see {TableName, LockID}
> there, but I failed to create a new tuple in that function.
>
Yes please. Although as you say, the PR should should also contain the lock
id so that it's possible to know which hash slot is the culprit. You should
be able to just add some memory extra allocation to the erts_alloc call
just above the for look and then use the TUPLE2() macro to create the
tuple, something like:
tb->locks = (DbTableHashFineLocks*)
erts_db_alloc_fnf(ERTS_ALC_T_DB_SEG, /* Other type maybe? */
(DbTable *) tb,
sizeof(DbTableHashFineLocks) + sizeof(Eterm) * DB_HASH_LOCK_CNT);
Eterm *hp = (Eterm*)(tb->locks+1);
for (i=0; i<DB_HASH_LOCK_CNT; ++i) {
erts_smp_rwmtx_init_opt_x(&tb->locks->lck_vec[i].lck, &rwmtx_opt,
"db_hash_slot", TUPLE2(hp, tb->common.the_name, make_small(i)));
hp += 3;
}
>
> Thing still unclear:
> - Why does ETS usage pattern affect processes which do not use ETS?
>
I don't know in your specific case, but in general eliminating contention
points like these is a constant game of whack a mole. When you eliminate
one, all the processes are free to bang on another contention point so you
end up with contention somewhere else. I've even seen cases where
eliminating a contention point lead to a slower overall system as another
contention point became even more contended which slowed down the system
significantly.
> - Are there more hidden ETS tuning options?
>
Most likely, we constantly introduce different tuning options to see if
they help or not in specific cases, not all of them get documented for
various reasons.
> - What else can we do to make this system faster? Free system resources
> are enough for doing 4—5 times more job.
>
Continue doing what you are doing :) Maybe use linux perf to see if you can
get any information from it?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160818/eeb22c4c/attachment.htm>
More information about the erlang-questions
mailing list