[erlang-questions] code_server:call/2 problem?
Mon Mar 21 13:22:37 CET 2011
Thanks for your answer. Indeed, this could have been explanation; however
the node is crashing after that error, not before nor "in parallel" to the
Actually I believe there is a bug in the Erlang runtime. I strongly
suspect there is a small time window during which a race condition can
occur: apparently code:load_binary can be triggered (thanks to
rpc:multicall) on a just-launched node before at least one of its system
processes succeeds in registering its name. At least that's what I came to
think after having peered at lib/kernel/src/code_server.erl, the badarg
that occured may come from the fact that call/2 is called whereas Name is
not registered (yet), in:
call(Name, Req) ->
Name ! {code_call, self(), Req},
{?MODULE, Reply} ->
As a test, the non-systematic crash which, on our short test case (done on
Ubuntu 64-bit running on a 4--core Core i7 laptop), was taking on average
30 seconds (loop of ~15 attempts) before happening, never happened with
the same loop being run for more than one hour, once I inserted a
timer:sleep(1000) in my deployment manager between the launching of the
remote VM and the call to rpc:multicall (knowing that intermediate
checkings like Erlang ping of the remote node and checking of the remote
Erlang version always succeeded).
I suppose there is in the runtime a kind of synchronous barrier where all
system processes are checked to be up and ready (including appropriately
registered) before serving user-space requests, but probably that at least
one system process was forgotten and thus led to such a race condition.
Unless I am mistaken?
Thanks in advance for any answer,
Best regards,
Olivier Boudeville
EDF R&D : 1, avenue du Général de Gaulle, 92140 Clamart, France
Département SINETICS, groupe ASICS (I2A), bureau B-226
Office : +33 1 47 65 59 58 / Mobile : +33 6 16 83 37 22 / Fax : +33 1 47
65 27 13
Do you see any crashes on the remote nodes?
It does look like the remote code_server application has got a request,
but it for some reason fails (corrupted data perhaps?). I'm wondering if
you could attach a remote shell to one of those nodes and trace the
code_server module?
We are running a distributed Erlang program on a user node from which a
number of computing nodes are spawned, via SSH for the remote hosts. To
perform the automatic deployment, two deployment-related modules are sent
to each of the spawned nodes, using the traditional approach (first a call
to code:get_object_code/1 then a rpc:multicall of code:load_binary).
However, sometimes (not frequently), with the exact same settings, the
first module cannot be deployed successfully. We have indeed:
{ResList,BadNodes} = rpc:multicall( Nodes, code, load_binary, [
ModuleName, ModuleFilename, ModuleBinary ], Timeout ),
that returns:
ResList =
BadNodes = []
This happens with R14B02, but most probably with previous versions as
Apparently this happens often (always?) on a node created on the user
I am pretty sure the deployed node is "fresh" (blank, vanilla).
And ignoring the badrpc will result in a undef error as soon as the first
function of the first helper module is called, even if delaying the call
(a race condition was suspected if ever the actual loading was
Would anyone see a cause for such a badarg non-systematic error?
Thanks in advance for any hint,
Best regards,
Olivier Boudeville.
Olivier Boudeville
EDF R&D : 1, avenue du Général de Gaulle, 92140 Clamart, France
Département SINETICS, groupe ASICS (I2A), bureau B-226
Office : +33 1 47 65 59 58 / Mobile : +33 6 16 83 37 22 / Fax : +33 1 47
65 27 13
