[erlang-questions] rpc:call has a delay in response when calling a nodedown

Mon Aug 30 09:27:45 CEST 2010

  If the system is down it will take a few seconds (sometimes much more) 
to know that if you use rpc:call/4,5 straight away. Adding a timeout is 
only useful when the call has gone through to the other side, it does 
not take the connection time into account (e.g. rpc:call(Node, timer, 
sleep, [3000], 1000).). A tip could be to have a connection pool which 
monitor nodes and abstract the rpc call so that you have one place where 
you keep track of available nodes and always (try) to connect to nodes 
that are available (there are corner cases of course). If you are load 
balancing you probably already have this; you are getting the nodename 
from somewhere so you could have that somewhere keep track of which node 
to supply. erlang:monitor_node/2 etc could be used for this.

Another thing to consider is that it doesn't matter how careful you are; 
networks are always hard to predict and can fluctuate A LOT. If you 
create the system in a way that 7 seconds (in opposed to 1-2 seconds) 
"lead to real problems" then I would argue that you are doing it wrong. 
The system has to be resilient enough to fall back on alternative nodes 
in case one node is unresponsive even if it takes 120 seconds (TCP/IP 
timeout). I don't know how hard your requirements are on these things 
but making assumptions on network ping and availability is extremely 
difficult and is best countered by letting them be part of the design.

/Mazen

On 30/08/2010 10:50, Magda Mansour wrote:
> Hello,
>
> We are encountering the following problem :
>
> The command below responds after a few seconds (~7 seconds)
>     rpc:call(dummy_node@REDACTED, erlang, whereis, [dummy_process]).
>     {badrpc,nodedown}
> where dummy_host is defined in /etc/hosts but is unreachable
>
> Adding a timeout in the fourth argument does not fasten the response 
> of this command :
>     rpc:call(dummy_node@REDACTED, erlang, whereis, [dummy_process], 
> 1000).
>     {badrpc,nodedown}
>
> We are using OTP-R11B-5 on Red Hat Enterprise Linux ES release 4 (Nahant
> Update 8) or Ubuntu
>
> As a work around, we were obliged to check if dummy_node@REDACTED is
> part of [node()|nodes()] before calling rpc:call.
>
> Is this delay considered as the normal behaviour of rpc:call ? Is there
> another work around or an alternative function to use ?
>
> This example seems to be dummy, but on a live production cluster, where
> one node has a problem, this delay can lead to real problems.
>
> Thank you in advance,
> Magda Mansour
>
>
> This message, including attachments, is intended solely for the 
> addressee indicated in this message and is strictly confidential or 
> otherwise privileged. If you are not the intended recipient (or 
> responsible for delivery of the message to such person) : - (1) please 
> immediately (i) notify the sender by reply email and (ii) delete this 
> message and attachments, - (2) any use, copy or dissemination of this 
> transmission is strictly prohibited. If you or your employer does not 
> consent to Internet email messages of this kind, please advise Myriad 
> Group AG by reply e-mail immediately. Opinions, conclusions and other 
> information expressed in this message are not given or endorsed by 
> Myriad Group AG unless otherwise indicated by an authorized 
> representative independent of this message.
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>