[erlang-questions] On node disconnections

Olivier BOUDEVILLE olivier.boudeville@REDACTED
Tue Apr 2 18:08:24 CEST 2013


Hi all,

I had a few questions about node disconnections. Currently I have a 
distributed application that must resist the crash of at least some of its 
hosts. I test the whole feature first by using ten remote virtual machines 
which I software-disconnect from the user host at random moments of the 
application execution, thanks to a merciless 'ifconfig down' of the 
relevant network interface. 

It works great, insofar as the communications then freeze immediately (no 
surprise). My intent was to monitor, from a specific process on the user 
node, each of these 10 worker nodes (using net_kernel:monitor_nodes/2), to 
receive the corresponding 'nodedown' messages and (if monitoring does not 
prevent 'noconnection' to be triggered) then to issue a disconnect_node/1 
for each of them, so that my user node can resist these losses. 

However, most of the time I cannot intercept the 'nodedown' information 
soon enough (or at all), and the whole program crashes and burns, with a 
message:

         ** Node 'N' not responding ** 
        ** Removing (timedout) connection ** 
        {"init terminating in do_boot",noconnection}

So, my question: how can I prevent this noconnection to wreak havoc, as it 
seems to ruin our ability to resist node losses?

If I understand well, as these reliability messages shall be managed "out 
of band", there is always a race condition between their receiving and the 
telling to all processes to stop interacting with the lost node(s). So if 
there were no way of at least temporarily resisting 'noconnection' (as 
whatever we do there *will* be processes that will attempt to send a 
message to a lost node), the whole purpose of the approach would be 
defeated. Unless I misunderstood something?

A related question is that, apparently, increasing the kernel net tick 
time (say, from 60 to 300) does not seem to increase accordingly the 
noconnection time-out that must exist somewhere. As a result, I think that 
by design the node monitoring can only fail then (a noconnection will 
always happen before the monitoring messages have a chance to kick in).

Thanks in advance for any hint!

Best regards,

Olivier.
---------------------------
Olivier Boudeville

EDF R&D : 1, avenue du Général de Gaulle, 92140 Clamart, France
Département SINETICS, groupe ASICS (I2A), bureau B-226
Office : +33 1 47 65 59 58 / Mobile : +33 6 16 83 37 22 / Fax : +33 1 47 
65 27 13



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus.
____________________________________________________

This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130402/03247906/attachment.htm>


More information about the erlang-questions mailing list