<div dir="ltr">Hi Ulf,<div><br></div><div>Good to know the effect of +zddbl. I'm not sure a 3rd party will help</div><div>deconflict in all cases though. A good example is the Ctrl-Z test. Suspending<br></div><div>the Erlang VM (or any OS level process) for a reasonable period, say 5 or 15 seconds,</div><div>then bringing it back to the foreground (eg: fg on linux) whilst under a reasonable production load</div><div>is a great way to find issues here.</div><div><br></div><div>Many failure detectors in commercial and open source projects fail this simple test</div><div>and it sounds like distributed erlang, through actively suspending the tick process is</div><div>also less than ideal, but with careful tuning to load it sounds reasonably manageable</div><div>in practice.</div><div><br></div><div>Cheers,</div><div><br></div><div>Darach.</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 10, 2014 at 9:54 PM, Ulf Wiger <span dir="ltr"><<a href="mailto:ulf@feuerlabs.com" target="_blank">ulf@feuerlabs.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><span class=""><br><div><div>On 10 Oct 2014, at 04:28, Akash Chowdhury <<a href="mailto:achowdhury918@gmail.com" target="_blank">achowdhury918@gmail.com</a>> wrote:</div><br><blockquote type="cite"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><em style="font-family:Helvetica;font-size:12px;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">=ERROR REPORT==== ...</em><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important">>>></span><i style="font-family:Helvetica;font-size:12px;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span> </span>** Node<secondary node> not responding **</i><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important">>>></span><i style="font-family:Helvetica;font-size:12px;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span> </span>** Removing (timedout) connection **</i><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important">>>></span></blockquote><br></div></span><div>This message doesn’t in itself positively indicate a netsplit, but since you observe the app running on both nodes, this is what has happened.</div><div><br></div><div>Specifically, the secondary node failed to send pings (or the pings got stuck/lost along the way) for four consecutive ping intervals. The length of the ping interval is set using:</div><div><br></div><div>-kernel net_ticktime Seconds</div><div><br></div><div>Default is 60 seconds. I would not set it to a value less than 10 seconds, except after rigorous testing under realistic conditions.</div><div><br></div><div>There could be many reasons for the connection timing out, but the Erlang VM is usually not that unresponsive, if it has sufficient resources. You should investigate whether anything in your local environment blocks the VM schedulers or otherwise renders the VM unresponsive.</div><div><br></div><div>If the network is flaky, or you’re pushing a lot of data, you might want to play with the</div><div><br></div><div>+zdbbl size</div><div><br></div><div>flag, which controls the buffer size for outgoing messages on Distributed Erlang. If this buffer becomes full, the VM will suspen any process that tries to send a message, and if this process is the ‘tick’ process, you have a problem. See the ‘erl’ man page for details.</div><div><br></div><div>The problem of determining whether to start the application or not when you get a ‘nodedown’ indication is a hard one. Ideally, you should designate a third party: if you can talk to the third party, and it tells you you’re ok, you’re good to go. If the third party is able to deconflict between two nodes pinging it at the same time, great.</div><div><br></div><div>A particular Erlang flavor of this problem is that the nodes may auto-connect again, which may, or may not, actually make things worse. One way to deal with that is to set -kernel dist_auto_connect once, which means that a node will only auto-connect with another node once in its life time; it needs to restart before it can auto-connect again. If you combine this with e.g. a UDP ping between the nodes, you can detect whether the nodes have lost contact, but are both still alive. If this occurs, you can decide (e.g. using the UDP messages) to restart one of the nodes.</div><div><br></div><div>BR,</div><div>Ulf W</div><br><div>
<span style="border-collapse:separate;border-spacing:0px"><div><div>Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.</div><div><a href="http://feuerlabs.com" target="_blank">http://feuerlabs.com</a></div></div><div><br></div></span><br>
</div>
<br></div><br>_______________________________________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>
<br></blockquote></div><br></div>