epmd death

Sat Sep 4 19:56:28 CEST 2004

Hi Kent,

Thanks for the reply. I luckily have never had epmd crash on me.
Just to add to the discussion: Even if you kill epmd after you have all of
your
distributed nodes communicating, it doesn't matter because both nodes then
know about each other. If you try to contact a third node that was never
previously
contacted, then you are in trouble. By then I should be able to detect a
downed
epmd and restart it before any new nodes come online.

As for the other platforms, can they detect a bad write on the epmd socket?
Would
they get this sort of indication? If they could, then it would be fairly
safe to add reconnection
logic to epmd.

Thanks
Ernie
----- Original Message ----- 
From: "Kent Boortz" <kent@REDACTED>
To: "Ernie Makris" <ernie.makris@REDACTED>
Cc: <erlang-questions@REDACTED>
Sent: Saturday, September 04, 2004 1:36 PM
Subject: Re: epmd death

>
> "Ernie Makris" <ernie.makris@REDACTED> writes:
> > I have a concern that if epmd for some reason crashes, then my
distributed
> > nodes
> > can't contact each other even if a new epmd is started. I have two
> > distributed nodes
> > setup on the same machine. I then kill epmd and then try to have one
node
> > rpc to another,
> > which gives me a {badrpc,nodedown}.
> >
> > I took a look at net_kernel and erl_epmd and there doesn't look like
there
> > is a reconnection
> > feature. Does anyone ever have any problems of this happening? Is there
any
> > workaround?
> >
> > Of course I could setup a separate socket and communicate through that,
but
> > it defeats the purpose
> > of distributed erlang:(
>
> Epmd is written to be small and simple to avoid problems with it
> crashing. There have been very few bug reports (only one serious that
> I can remember) after the code was cleaned up and test cases where
> added many years back.
>
> But it can of course happen (*). The Erlang node keeps a socket
> connection to epmd so it should not be that hard for an Erlang node to
> detect that epmd has died and try to restart it. For compatibility
> with WxWorks, and other OS'es that don't detect a close on a socket,
> there should probably be some sort of periodic ping between the node
> and epmd. The only complication with the restarting is that there may
> be several nodes on the same machine that all try to restart epmd at
> the same time. But this is not that hard to handle,
>
> kent
>
> (*) There is known that there have been product setups that use "in
> place" updates of the epmd binary. If you upgrade the binary for a
> running program, the program will die on some (most/all?) Unix'es. The
> program may not die directly when the binary is updated, it may take
> some time until the OS runs into problems because of the original
> binary being missing. Other than that there are no know problems with
> epmd that I'm aware of. Except the fact that a simple "epmd -kill" by
> any user on a machine will kill epmd ;-)
>
> -- 
> Kent Boortz, Senior Software Developer
> MySQL AB, www.mysql.com
> Office: +46 8 590 910 63
> Mobile: +46 70 279 11 71