Erlang DNS resolution and send operation speed when two nodes are splitted

Thu Nov 26 13:00:57 CET 2020

Hi,
I did indeed encountered such a thing (not with the node down scenario), i
forget the details but one thing that helped was replacing the `native`
lookup method with the `dns`, (
https://erlang.org/doc/apps/erts/inet_cfg.html), iirc some micro-benchmarks
told me that the native was a lot slower.
hope this helps

On Thu, Nov 26, 2020 at 11:36 AM Pierre Allix <pierre.allix.work@REDACTED>
wrote:

> We have a productive system where one Erlang node connected to another node
> performs badly when the second node is down. In such a case the message
> queues on
> the first node augment and the node does not manage to process its messages
> quickly enough. The communication between the nodes is done with
> GenServer.cast.
> When sending a message to the second node from the first node we use a
> call similar to
> GenServer.cast({:registered_name_on_the_remote, "
> nodeid@REDACTED"}, msg).
>
> With tcpdump, traces, trial and errors and reading some of the Erlang
> code, I
> have identified that the :inet_gethost_native.gethostbyname is invoked a
> huge
> number of times as long as the second down is stopped. This increase in
> load
> causes the function to sometimes take more than 3 seconds before returning
> a
> value! With tcpdump I have measured how long the DNS queries take and can
> confirm
> that this increase in time is due to the Erlang code and not to the
> network or
> dns server (max time for a DNS query is 15ms).
>
> I have created a proof of concept with a detailed README showing how to
> reproduce the problem:
> https://www.github.com/pallix/erlang_dns_under_stress
>
> It creates much more IO and less CPU load than our production problem but
> the
> external behavior is similar: when the second node is restarted, the DNS
> resolution is
> fast again. My knowledge of Erlang is limited so the code is in Elixir but
> simple enough that everybody knowing Erlang should be able to reproduce the
> problem in Erlang.
>
> My team and I are surprised that such a situation put Erlang under stress
> and that just disconnecting/stopping (with Ctrl-C or systemctl stop
> <service>) the node causes this problem. After all, a node being down
> should not be a big deal and DNS queries should be
> cached.
>
> We used Erlang 21, Elixir 1.10 and Debian Stretch when we discovered the
> problem. We have know updated to Erlang 23, Elixir 1.11 and Debian Buster
> and
> the problem persists.
>
> Would you consider this behavior a bug?
>
> Do you have experience with such a situation?
>
> I have posted a similar message on the elixir forum a few days ago but did
> not get any answer so I think it's appropriate to ask here, some experts
> may have more knowledge of the BEAM internals.
>
> Any help will be highly appreciated.
>

-- 
PGP fingerprint: F708 E141 AE8D 2D38 E1BC  DF3D 1719 3EA0 647D 7260
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20201126/0a6a2643/attachment.htm>