[erlang-questions] heart issue
Serge Aleynikov
serge@REDACTED
Thu Dec 14 01:14:36 CET 2006
Greeting!
I'd like to ask a question that someone might have dealt with in the
past. We have a node running with a -heart option:
drp@REDACTED: ~$ pstree -p drp
run_erl(24196)───beam(24197)─┬─heart(24236)
├─inet_gethost(24243)───inet_gethost(24244)
└─sh(24257)
drp@REDACTED: ~$ ps auxww | grep beam
drp 24197 0.2 4.6 562480 193724 pts/2 Ssl+ 17:58 0:44
/home/drp/dripdb/erts-5.5/bin/beam -A32 -Bi -- -root /home/drp/dripdb
-progname dripdb -- -home /home/drp -boot
/home/drp/dripdb/releases/1.1/start -config
/home/drp/dripdb/releases/1.1/sys -sname drpdb -mnesia dir
"/home/drp/dripdb/var/data/mnesia" -heart -kernel dist_auto_connect once
Once out of a blue moon (once a month or so) the heart port program
detects closing of the read pipe from Erlang and restarts the emulator.
The emulator doesn't show any issues with memory or any other
problems. The number of running processes is small. The last messages
in the console (logged by run_erl) were:
...
[drpdb/drpdb:1102] Reading file "switch.txt" (3522 bytes)
[drpdb/drpdb:1114] Read 105 lines (0.0 s).
[drpdb/drpdb:1102] Reading file "dial_code.txt" (6700430 bytes)
heart: Wed Dec 13 17:58:02 2006: Erlang has closed.
Note that the "Reading file Filename" message is printed before doing
file:consult(Filename), which is CPU consuming (parsing that file takes
15 seconds, during which CPU is at 100% on this multi-CPU machine). The
last message above is printed by the "heart" program detecting a EOF on
the read file descriptor, and therefore it brutally kills and restarts
the emulator.
What may cause the read on that file descriptor used by "heart" to
return 0 (EOF)?
We examined the code of heart.erl and heart.c, and think that the
recovery protocol needs to be tuned a bit (though I am not so sure that
this has anything to do with the nature of the problem) to enhance the
false positive restarts. Upon detecting an EOF on the read fd, the
heart should try to send a command to the emulator port process asking
to restart the heart port program. If that write fails, only then
restart the emulator. Otherwise wait for some time to allow the
emulator to restart the port program (via erlang:port_close(Port)), and
upon expiration of that timer restart the emulator. I am attaching the
patch that does this logic.
However, the main question on why "heart" detects disconnect from Erlang
is still open.
Regards,
Serge
--
Serge Aleynikov
Routing R&D, IDT Telecom
Tel: +1 (973) 438-3436
Fax: +1 (973) 438-1464
-------------- next part --------------
A non-text attachment was scrubbed...
Name: heart.R11B-2.patch
Type: text/x-patch
Size: 3025 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20061213/3ea2f369/attachment.bin>
More information about the erlang-questions
mailing list