<div dir="ltr">Hi,<div>I did indeed encountered such a thing (not with the node down scenario), i forget the details but one thing that helped was replacing the `native` lookup method with the `dns`, (<a href="https://erlang.org/doc/apps/erts/inet_cfg.html">https://erlang.org/doc/apps/erts/inet_cfg.html</a>), iirc some micro-benchmarks told me that the native was a lot slower.</div><div>hope this helps</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Nov 26, 2020 at 11:36 AM Pierre Allix <<a href="mailto:pierre.allix.work@gmail.com">pierre.allix.work@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">We have a productive system where one Erlang node connected to another node<br>performs badly when the second node is down. In such a case the message queues on<br>the first node augment and the node does not manage to process its messages<br>quickly enough. The communication between the nodes is done with GenServer.cast.<br>When sending a message to the second node from the first node we use a call similar to<br>GenServer.cast({:registered_name_on_the_remote, "<a href="mailto:nodeid@server2.example.com" target="_blank">nodeid@server2.example.com</a>"}, msg).<br><br>With tcpdump, traces, trial and errors and reading some of the Erlang code, I<br>have identified that the :inet_gethost_native.gethostbyname is invoked a huge<br>number of times as long as the second down is stopped. This increase in load<br>causes the function to sometimes take more than 3 seconds before returning a<br>value! With tcpdump I have measured how long the DNS queries take and can confirm<br>that this increase in time is due to the Erlang code and not to the network or<br>dns server (max time for a DNS query is 15ms).<br><br>I have created a proof of concept with a detailed README showing how to reproduce the problem:<br><a href="https://www.github.com/pallix/erlang_dns_under_stress" target="_blank">https://www.github.com/pallix/erlang_dns_under_stress</a><br><br>It creates much more IO and less CPU load than our production problem but the<br>external behavior is similar: when the second node is restarted, the DNS resolution is<br>fast again. My knowledge of Erlang is limited so the code is in Elixir but simple enough that everybody knowing Erlang should be able to reproduce the problem in Erlang.<br><br>My team and I are surprised that such a situation put Erlang under stress and that just disconnecting/stopping (with Ctrl-C or systemctl stop <service>) the node causes this problem. After all, a node being down should not be a big deal and DNS queries should be<br>cached.<br><br>We used Erlang 21, Elixir 1.10 and Debian Stretch when we discovered the<br>problem. We have know updated to Erlang 23, Elixir 1.11 and Debian Buster and<br>the problem persists.<br><br>Would you consider this behavior a bug?<br><br>Do you have experience with such a situation?<br><br><div>I have posted a similar message on the elixir forum a few days ago but did not get any answer so I think it's appropriate to ask here, some experts may have more knowledge of the BEAM internals.</div><div><br></div><div>Any help will be highly appreciated. </div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">PGP fingerprint: F708 E141 AE8D 2D38 E1BC DF3D 1719 3EA0 647D 7260</div></div>