[erlang-questions] global:register_name/2 hang?

Bernard Duggan bduggan@REDACTED
Thu Aug 14 08:42:45 CEST 2014


Hi list - it's been a while! Is everyone well? Excellent.

We ran into a production issue the other day which I've been scratching
my head over for the past three days: One of our supervised, 'permanent'
processes crashed, which should have been fine (for some value of
"fine"), but the problem is that it didn't restart for 37 minutes.

After much coffee, staring and our logs and poring over both our and the
OTP code, it looks like the only plausible culprit for this delay is a
hang when the supervisor is restarting the process and gen_server goes
to register the new process with the global nameserver (this particular
process is globally registered). Now obviously crying "there's a bug in
the OTP" should always be an option of last resort, but since the very
first line of our gen_server's init function prints out a debug line,
and that line doesn't appear until 37 minutes later, there's just not
too much scope for looking elsewhere - believe me, I've tried :)).
global:register_name/2 seems like an especially promising candidate
since it ultimately makes a gen_server call with an infinite timeout
(I'm using/looking at R16B01 here, but nothing in change logs from
earlier/later versions suggests that anything significant has changed in
this area).

In looking for any kind of clues, I also came across these two posts:
http://erlang.org/pipermail/erlang-questions/2010-November/054700.html
http://erlang.org/pipermail/erlang-questions/2012-May/066369.html

both of which seem to describe something very similar to what I think
happened here. (And unfortunately neither of which garnered any response).

So here are my questions, I guess: Has anyone else encountered anything
similar? Can anyone with better knowledge than me of global's inner
workings suggest a mechanism whereby this could occur (and even better,
how to avoid it)? Alternatively, can anyone suggest any good candidates
for looking at really closely within global (or elsewhere) to try to pin
this problem down?

It's probably worth mentioning that our system doesn't have many nodes -
usually it's the main (Erlang) node, three JInterface nodes, one C node
(created by erld) and one other native Erlang node that comes and goes
on a regular basis (at something like 5 minute intervals).

Thanks all.

Cheers,

Bernard

________________________________

This e-mail and any attachments are confidential. If it is not intended for you, please notify the sender, and please erase and ignore the contents.



More information about the erlang-questions mailing list