[erlang-questions] two epmds running

Wed Mar 17 02:07:39 CET 2010

We have faced the same behavior described by Bob.  The problem occurs only  
when rebooting a server that has two or more Erlang virtual machines  
started by init.  The problem when it happens can easily consume a  
significant amount of disk space in the /var/log directory by epmd's error  
logging.   It is unknown how to directly trigger the problem.

On Wed, 17 Mar 2010 02:13:54 +0900, Bob Ippolito <bob@REDACTED> wrote:

> On Tue, Mar 16, 2010 at 9:10 AM, Garrett Smith <g@REDACTED> wrote:
>> On Tue, Mar 16, 2010 at 9:36 AM, Bob Ippolito <bob@REDACTED> wrote:
>>> On Tue, Mar 16, 2010 at 2:23 AM, Anthony Shipman <als@REDACTED>  
>>> wrote:
>>>> Sometimes it happens that I discover two epmd processes running. One  
>>>> of
>>>> them is in a tight loop consuming 100% of CPU time. My guess is that  
>>>> the
>>>> second one is started automatically because the first one is no longer
>>>> responding. Is this a known bug in epmd?
>>>
>>> I think we have seen this before, one of them is probably violently
>>> logging "epmd: epmd: error in accept" as well. We have only seen this
>>> on boot-up of a machine, probably due to several Erlang VMs trying to
>>> start up at the same time. We don't currently have a solution for this
>>> issue (mostly because we don't know the root cause yet).
>>>
>>> I am not sure we get two of them, it might be just one in our case.
>>
>> I haven't seen two running, but I've seen none running, which is a
>> real bummer. I've written a monitor process (probably gen_fsm based)
>> that keeps an eye on epmd and starts it and reinitializes it when it
>> goes away. A properly functioning epmd is important enough that you
>> might consider something similar to ensure that, in your case, that
>> rogue process is dealt with (killed?).
>>
>> I suppose that's somewhat flippant -- to say write your own monitor
>> for this, but losing epmd is like losing your network and people go to
>> great lengths to keep networks up.
>
> Yeah absolutely it needs to be killed when it's in that state. It eats
> up a lot of CPU, spews endless crap to syslog, and breaks erlang
> distribution on that node. We haven't seen it often enough to feel too
> much pain yet but it's something on our roadmap to try and reproduce
> and fix or work around it.
>
> When we kill it we also bring down all of the applications on that
> node, which sucks because we can't shut them down cleanly since doing
> that (at least by the means that our tools know how) depends on epmd
> being up. Fortunately we have only seen this happen just after a
> reboot.
>
> -bob
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>

-- 
norton@REDACTED