[erlang-questions] RE On node disconnections

Olivier BOUDEVILLE olivier.boudeville@REDACTED
Wed Apr 10 11:27:52 CEST 2013


Hi,

To answer my own question (why the loss of a remote node was making a 
monitoring node crash with an untrapable 'noconnection' error), it was due 
to the fact that this monitoring node was run with "-eval": launching the 
same application, this time from the Erlang shell, exhibited no crash 
dump, and the node loss was correctly detected. 

However, often an application must be run from a script (not from an 
interactive shell; ex: in order to run on a cluster), and apparently none 
of the solutions to do so with the 'init' module (namely "-eval", "-run" 
and "-s") allows to resist to these kinds of runtime failures.

So: what could be the solution to run a VM as batch (without the 
interactive shell) while still being able to overcome errors (such as 
'noconnection')?

Thanks in advance for any hint!

Best regards,

Olivier.
---------------------------
Olivier Boudeville

EDF R&D : 1, avenue du Général de Gaulle, 92140 Clamart, France
Département SINETICS, groupe ASICS (I2A), bureau B-226
Office : +33 1 47 65 59 58 / Mobile : +33 6 16 83 37 22 / Fax : +33 1 47 
65 27 13



Olivier BOUDEVILLE/IMA/DER/EDFGDF/FR 
Envoyé par : Olivier BOUDEVILLE/A/EDF/FR
02/04/2013 18:08

A
erlang-questions@REDACTED
cc

Objet
On node disconnections





Hi all,

I had a few questions about node disconnections. Currently I have a 
distributed application that must resist the crash of at least some of its 
hosts. I test the whole feature first by using ten remote virtual machines 
which I software-disconnect from the user host at random moments of the 
application execution, thanks to a merciless 'ifconfig down' of the 
relevant network interface. 

It works great, insofar as the communications then freeze immediately (no 
surprise). My intent was to monitor, from a specific process on the user 
node, each of these 10 worker nodes (using net_kernel:monitor_nodes/2), to 
receive the corresponding 'nodedown' messages and (if monitoring does not 
prevent 'noconnection' to be triggered) then to issue a disconnect_node/1 
for each of them, so that my user node can resist these losses. 

However, most of the time I cannot intercept the 'nodedown' information 
soon enough (or at all), and the whole program crashes and burns, with a 
message:

         ** Node 'N' not responding ** 
        ** Removing (timedout) connection ** 
        {"init terminating in do_boot",noconnection}

So, my question: how can I prevent this noconnection to wreak havoc, as it 
seems to ruin our ability to resist node losses?

If I understand well, as these reliability messages shall be managed "out 
of band", there is always a race condition between their receiving and the 
telling to all processes to stop interacting with the lost node(s). So if 
there were no way of at least temporarily resisting 'noconnection' (as 
whatever we do there *will* be processes that will attempt to send a 
message to a lost node), the whole purpose of the approach would be 
defeated. Unless I misunderstood something?

A related question is that, apparently, increasing the kernel net tick 
time (say, from 60 to 300) does not seem to increase accordingly the 
noconnection time-out that must exist somewhere. As a result, I think that 
by design the node monitoring can only fail then (a noconnection will 
always happen before the monitoring messages have a chance to kick in).

Thanks in advance for any hint!

Best regards,

Olivier.
---------------------------
Olivier Boudeville

EDF R&D : 1, avenue du Général de Gaulle, 92140 Clamart, France
Département SINETICS, groupe ASICS (I2A), bureau B-226
Office : +33 1 47 65 59 58 / Mobile : +33 6 16 83 37 22 / Fax : +33 1 47 
65 27 13



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus.
____________________________________________________

This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130410/a3e792c5/attachment.htm>


More information about the erlang-questions mailing list