Erlang DNS resolution and send operation speed when two nodes are splitted

Pierre Allix pierre.allix.work@REDACTED
Thu Nov 26 11:23:08 CET 2020


We have a productive system where one Erlang node connected to another node
performs badly when the second node is down. In such a case the message
queues on
the first node augment and the node does not manage to process its messages
quickly enough. The communication between the nodes is done with
GenServer.cast.
When sending a message to the second node from the first node we use a call
similar to
GenServer.cast({:registered_name_on_the_remote, "nodeid@REDACTED"},
msg).

With tcpdump, traces, trial and errors and reading some of the Erlang code,
I
have identified that the :inet_gethost_native.gethostbyname is invoked a
huge
number of times as long as the second down is stopped. This increase in load
causes the function to sometimes take more than 3 seconds before returning a
value! With tcpdump I have measured how long the DNS queries take and can
confirm
that this increase in time is due to the Erlang code and not to the network
or
dns server (max time for a DNS query is 15ms).

I have created a proof of concept with a detailed README showing how to
reproduce the problem:
https://www.github.com/pallix/erlang_dns_under_stress

It creates much more IO and less CPU load than our production problem but
the
external behavior is similar: when the second node is restarted, the DNS
resolution is
fast again. My knowledge of Erlang is limited so the code is in Elixir but
simple enough that everybody knowing Erlang should be able to reproduce the
problem in Erlang.

My team and I are surprised that such a situation put Erlang under stress
and that just disconnecting/stopping (with Ctrl-C or systemctl stop
<service>) the node causes this problem. After all, a node being down
should not be a big deal and DNS queries should be
cached.

We used Erlang 21, Elixir 1.10 and Debian Stretch when we discovered the
problem. We have know updated to Erlang 23, Elixir 1.11 and Debian Buster
and
the problem persists.

Would you consider this behavior a bug?

Do you have experience with such a situation?

I have posted a similar message on the elixir forum a few days ago but did
not get any answer so I think it's appropriate to ask here, some experts
may have more knowledge of the BEAM internals.

Any help will be highly appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20201126/e17d837f/attachment.htm>


More information about the erlang-questions mailing list