[erlang-questions] monitor and failure detectors

Thu Jun 3 10:46:46 CEST 2010

Ok, so it's a global parameter for all monitors. From a application level  
point of view it could be an advantage to have this as a per monitor value.

When a node does terminate in a controlled way, does it not inform other  
nodes that it is terminating? Does it only close the connections to other  
nodes and let them trigger on the closed connection. A smother way woudl  
be to send a last "bye" message and let them report that the node and all  
processes of that node are down.

-- Is the description in the manual ok?

"if nothing has been received from another node within the last four (4)  
tick times"

should it not read

"if nothing has been received from another node within the last four (4)  
ticks"

It's confusing that "tick times" is not "ticktime".

"TickTime is by default 60 (seconds). Thus, 45 < T < 75 seconds."

How can T be smaller than 60? If a node dies then we have to wait for at  
least 4 ticks (60 s) before we detect it. If we are unlucky it could take  
60+15 but how could it be 45?

   Johan

On Thu, 03 Jun 2010 10:24:58 +0200, Ulf Wiger  
<ulf.wiger@REDACTED> wrote:

>
> Monitors have no timeout. They trigger immediately when either
> the process dies or the node of the process is disconnected.
>
> Is it the latter event that you are referring to?
> This can be configured using -kernel net_ticktime T, where
> T is 60 seconds by default.
>
> See http://www.erlang.org/doc/man/kernel_app.html
>
> "net_ticktime = TickTime
>
> Specifies the net_kernel tick time.
> TickTime is given in seconds. Once every TickTime/4 second,
> all connected nodes are ticked (if anything else has been
> written to a node) and if nothing has been received from
> another node within the last four (4) tick times that node
> is considered to be down. This ensures that nodes which are
> not responding, for reasons such as hardware errors, are
> considered to be down.
>
> The time T, in which a node that is not responding is
> detected, is calculated as: MinT < T < MaxT where:
>
> MinT = TickTime - TickTime / 4
> MaxT = TickTime + TickTime / 4
>
> TickTime is by default 60 (seconds). Thus, 45 < T < 75 seconds.
>
> Note: All communicating nodes should have the same TickTime value  
> specified.
>
> Note: Normally, a terminating node is detected immediately."
>
> BR,
> Ulf W
>
> Johan Montelius wrote:
>>
>> Some question on monitors:
>>
>> Is there a way to change the timeout of monitors to configure how eager
>> they will be to deliver a DOWN/noconnection message? Can this be changed
>> on a per monitor basis so one could monitor one process with a 0.1s
>> timeout and another with a 20s timeout?
>>
>>
>> I guess it is the empd daemon that is responsible for tracking the state
>> of nodes on a host. If a node crashes a monitor will report
>> DOWN/noconnection. Could it be possible to have monitor generate a
>> DOWN/down or similar when/if it can be determined that the node (and
>> thus the process that we monitor) is actually down. The
>> DOWN/noconnection message leaves us in a state where we don't know it is
>> down or simply disconnected.
>>
>>   Johan
>
>

-- 
Associate Professor Johan Montelius
Royal Institute of Technology - KTH
School of Information and Communication Technology - ICT