[erlang-questions] heart issue

Serge Aleynikov serge@REDACTED
Thu Dec 14 01:14:36 CET 2006


Greeting!

I'd like to ask a question that someone might have dealt with in the 
past.  We have a node running with a -heart option:

drp@REDACTED: ~$ pstree -p drp
run_erl(24196)───beam(24197)─┬─heart(24236)
                              ├─inet_gethost(24243)───inet_gethost(24244)
                              └─sh(24257)


drp@REDACTED: ~$ ps auxww | grep beam
drp      24197  0.2  4.6 562480 193724 pts/2 Ssl+ 17:58   0:44 
/home/drp/dripdb/erts-5.5/bin/beam -A32 -Bi -- -root /home/drp/dripdb 
-progname dripdb -- -home /home/drp -boot 
/home/drp/dripdb/releases/1.1/start -config 
/home/drp/dripdb/releases/1.1/sys -sname drpdb -mnesia dir 
"/home/drp/dripdb/var/data/mnesia" -heart -kernel dist_auto_connect once

Once out of a blue moon (once a month or so) the heart port program 
detects closing of the read pipe from Erlang and restarts the emulator. 
  The emulator doesn't show any issues with memory or any other 
problems.   The number of running processes is small. The last messages 
in the console (logged by run_erl) were:

...
[drpdb/drpdb:1102] Reading file "switch.txt" (3522 bytes)
[drpdb/drpdb:1114] Read 105 lines (0.0 s).
[drpdb/drpdb:1102] Reading file "dial_code.txt" (6700430 bytes)
heart: Wed Dec 13 17:58:02 2006: Erlang has closed.

Note that the "Reading file Filename" message is printed before doing 
file:consult(Filename), which is CPU consuming (parsing that file takes 
15 seconds, during which CPU is at 100% on this multi-CPU machine).  The 
last message above is printed by the "heart" program detecting a EOF on 
the read file descriptor, and therefore it brutally kills and restarts 
the emulator.

What may cause the read on that file descriptor used by "heart" to 
return 0 (EOF)?

We examined the code of heart.erl and heart.c, and think that the 
recovery protocol needs to be tuned a bit (though I am not so sure that 
this has anything to do with the nature of the problem) to enhance the 
false positive restarts.  Upon detecting an EOF on the read fd, the 
heart should try to send a command to the emulator port process asking 
to restart the heart port program.  If that write fails, only then 
restart the emulator.  Otherwise wait for some time to allow the 
emulator to restart the port program (via erlang:port_close(Port)), and 
upon expiration of that timer restart the emulator.  I am attaching the 
patch that does this logic.

However, the main question on why "heart" detects disconnect from Erlang 
is still open.

Regards,

Serge


-- 
Serge Aleynikov
Routing R&D, IDT Telecom
Tel: +1 (973) 438-3436
Fax: +1 (973) 438-1464
-------------- next part --------------
A non-text attachment was scrubbed...
Name: heart.R11B-2.patch
Type: text/x-patch
Size: 3025 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20061213/3ea2f369/attachment.bin>


More information about the erlang-questions mailing list