[erlang-questions] code_server:call/2 problem?

Olivier BOUDEVILLE olivier.boudeville@REDACTED
Mon Mar 21 13:22:37 CET 2011


Hi,

Thanks for your answer. Indeed, this could have been explanation; however 
the node is crashing after that error, not before nor "in parallel" to the 
error. 

Actually I believe there is a bug in the Erlang runtime. I strongly 
suspect there is a small time window during which a race condition can 
occur: apparently code:load_binary can be triggered (thanks to 
rpc:multicall) on a just-launched node before at least one of its system 
processes succeeds in registering its name. At least that's what I came to 
think after having peered at lib/kernel/src/code_server.erl, the badarg 
that occured may come from the fact that call/2 is called whereas Name is 
not registered (yet), in:

"""
call(Name, Req) ->
Name ! {code_call, self(), Req},
receive
   {?MODULE, Reply} ->
        Reply
end.
""""

As a test, the non-systematic crash which, on our short test case (done on 
Ubuntu 64-bit running on a 4--core Core i7 laptop), was taking on average 
30 seconds (loop of ~15 attempts) before happening, never happened with 
the same loop being run for more than one hour, once I inserted a 
timer:sleep(1000) in my deployment manager between the launching of the 
remote VM and the call to rpc:multicall (knowing that intermediate 
checkings like Erlang ping of the remote node and checking of the remote 
Erlang version always succeeded). 

I suppose there is in the runtime a kind of synchronous barrier where all 
system processes are checked to be up and ready (including appropriately 
registered) before serving user-space requests, but probably that at least 
one system process was forgotten and thus led to such a race condition. 
Unless I am mistaken?

Thanks in advance for any answer,
Best regards,

Olivier.
---------------------------
Olivier Boudeville

EDF R&D : 1, avenue du Général de Gaulle, 92140 Clamart, France
Département SINETICS, groupe ASICS (I2A), bureau B-226
Office : +33 1 47 65 59 58 / Mobile : +33 6 16 83 37 22 / Fax : +33 1 47 
65 27 13



mevans@REDACTED 
Envoyé par : erlang-questions@REDACTED
18/03/2011 19:22

A
olivier.boudeville@REDACTED, erlang-questions@REDACTED
cc

Objet
RE: [erlang-questions] code_server:call/2 problem?






Do you see any crashes on the remote nodes?

It does look like the remote code_server application has got a request, 
but it for some reason fails (corrupted data perhaps?). I'm wondering if 
you could attach a remote shell to one of those nodes and trace the 
code_server module?

-----Original Message-----
From: erlang-questions@REDACTED [mailto:erlang-questions@REDACTED] On 
Behalf Of Olivier BOUDEVILLE
Sent: Friday, March 18, 2011 1:32 PM
To: erlang-questions@REDACTED
Subject: [erlang-questions] code_server:call/2 problem?

Hi,

We are running a distributed Erlang program on a user node from which a 
number of computing nodes are spawned, via SSH for the remote hosts. To 
perform the automatic deployment, two deployment-related modules are sent 
to each of the spawned nodes, using the traditional approach (first a call 

to code:get_object_code/1 then a rpc:multicall of code:load_binary).

However, sometimes (not frequently), with the exact same settings, the 
first module cannot be deployed successfully. We have indeed:

{ResList,BadNodes} = rpc:multicall( Nodes, code, load_binary, [ 
ModuleName, ModuleFilename, ModuleBinary ], Timeout ),

that returns:
ResList = 
[{badrpc,{'EXIT',{badarg,[{code_server,call,2},{rpc,'-handle_call_call/6-fun-0-',5}]}}}]
BadNodes = []

This happens with R14B02, but most probably with previous versions as 
well.
Apparently this happens often (always?) on a node created on the user 
host.
I am pretty sure the deployed node is "fresh" (blank, vanilla).
And ignoring the badrpc will result in a undef error as soon as the first 
function of the first helper module is called, even if delaying the call 
(a race condition was suspected if ever the actual loading was 
asynchronous).

Would anyone see a cause for such a badarg non-systematic error?

Thanks in advance for any hint,
Best regards,

Olivier Boudeville.
---------------------------
Olivier Boudeville

EDF R&D : 1, avenue du Général de Gaulle, 92140 Clamart, France
Département SINETICS, groupe ASICS (I2A), bureau B-226
Office : +33 1 47 65 59 58 / Mobile : +33 6 16 83 37 22 / Fax : +33 1 47 
65 27 13



Ce message et toutes les pièces jointes (ci-après le 'Message') sont 
établis à l'intention exclusive des destinataires et les informations qui 
y figurent sont strictement confidentielles. Toute utilisation de ce 
Message non conforme à sa destination, toute diffusion ou toute 
publication totale ou partielle, est interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de 
le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou 
partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de 
votre système, ainsi que toutes ses copies, et de n'en garder aucune trace 
sur quelque support que ce soit. Nous vous remercions également d'en 
avertir immédiatement l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.
____________________________________________________

This message and any attachments (the 'Message') are intended solely for 
the addressees. The information contained in this Message is confidential. 
Any use of information contained in this Message not in accord with its 
purpose, any dissemination or disclosure, either whole or partial, is 
prohibited except formal approval.

If you are not the addressee, you may not copy, forward, disclose or use 
any part of it. If you have received this message in error, please delete 
it and all copies from your system and notify the sender immediately by 
return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED





Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus.
____________________________________________________

This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or virus-free.


More information about the erlang-questions mailing list