[erlang-questions] two epmds running

Bob Ippolito <>
Tue Mar 16 18:13:54 CET 2010


On Tue, Mar 16, 2010 at 9:10 AM, Garrett Smith <> wrote:
> On Tue, Mar 16, 2010 at 9:36 AM, Bob Ippolito <> wrote:
>> On Tue, Mar 16, 2010 at 2:23 AM, Anthony Shipman <> wrote:
>>> Sometimes it happens that I discover two epmd processes running. One of
>>> them is in a tight loop consuming 100% of CPU time. My guess is that the
>>> second one is started automatically because the first one is no longer
>>> responding. Is this a known bug in epmd?
>>
>> I think we have seen this before, one of them is probably violently
>> logging "epmd: epmd: error in accept" as well. We have only seen this
>> on boot-up of a machine, probably due to several Erlang VMs trying to
>> start up at the same time. We don't currently have a solution for this
>> issue (mostly because we don't know the root cause yet).
>>
>> I am not sure we get two of them, it might be just one in our case.
>
> I haven't seen two running, but I've seen none running, which is a
> real bummer. I've written a monitor process (probably gen_fsm based)
> that keeps an eye on epmd and starts it and reinitializes it when it
> goes away. A properly functioning epmd is important enough that you
> might consider something similar to ensure that, in your case, that
> rogue process is dealt with (killed?).
>
> I suppose that's somewhat flippant -- to say write your own monitor
> for this, but losing epmd is like losing your network and people go to
> great lengths to keep networks up.

Yeah absolutely it needs to be killed when it's in that state. It eats
up a lot of CPU, spews endless crap to syslog, and breaks erlang
distribution on that node. We haven't seen it often enough to feel too
much pain yet but it's something on our roadmap to try and reproduce
and fix or work around it.

When we kill it we also bring down all of the applications on that
node, which sucks because we can't shut them down cleanly since doing
that (at least by the means that our tools know how) depends on epmd
being up. Fortunately we have only seen this happen just after a
reboot.

-bob


More information about the erlang-questions mailing list