[erlang-questions] large scale deployments and netsplits

Tue Sep 15 10:41:12 CEST 2009

Hi Bengt,

Bengt Tillman wrote:
> 
> We have had to set the net ticktime to 300 in order to keep the Erlang
> nodes from losing contact with each other. The response times between
> different Erlang nodes is not mission critical in our application ...

I will admit that I have meditated over the network
tick algorithm in Erlang several times, without being any
wiser for it. It's a very nice piece of code, but I can't
help thinking that there is some fatal flaw buried deep
within it.

At AXD 301, we tried reducing the detection times as much
as we could, but never could get below a net_ticktime of 10
without getting lots of false positives. In contrast, our
own device processor supervision had shorter detection times
(5-6 seconds, if memory serves) and practically never any false
positives, using the very same communication network.

The code wasn't nearly as elegant, though. :)

To be fair, this was on an internal ATM network, so we could
be fairly sure that the internal communication paths were
never starved by other traffic. This is of course not true
in general for TCP/IP networks.

BR,
Ulf W
-- 
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com