[erlang-questions] heart issue
Serge Aleynikov
serge@REDACTED
Thu Dec 14 02:26:16 CET 2006
I was able to reproduce this problem using the steps below (this system
doesn't have the heart patch applied that I attached in the last email).
First you need to create a 100M file "dial_code.txt" with some Erlang
terms readable by file:consult/1, and then do:
$ echo "Process started: `date`" > test.txt
$ export HEART_COMMAND="echo \`date\` >> test.txt"
$ erl -heart +A10
heart_beat_kill_pid = 16734
Erlang (BEAM) emulator version 5.5 [source] [async-threads:10] [hipe]
Eshell V5.5 (abort with ^G)
1> _ = [spawn(fun() -> file:consult("dial_code.txt") end) || I <-
lists:seq(1,50)], ok.
ok
%-- Serge: note the next line happens after about 15 minutes elapse. --
2> heart: Wed Dec 13 18:59:54 2006: Erlang has closed.
$ cat test.txt
Process started: Wed Dec 13 18:44:15 EST 2006
Wed Dec 13 18:59:55 EST 2006
$ uname -a
Linux drp01dev.corp.idt.net 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST
2005 i686 i686 i386 GNU/Linux
Serge
Serge Aleynikov wrote:
> Greeting!
>
> I'd like to ask a question that someone might have dealt with in the
> past. We have a node running with a -heart option:
>
> drp@REDACTED: ~$ pstree -p drp
> run_erl(24196)───beam(24197)─┬─heart(24236)
>
> в”њв”Ђinet_gethost(24243)в”Ђв”Ђв”Ђinet_gethost(24244)
> в””в”Ђsh(24257)
>
>
> drp@REDACTED: ~$ ps auxww | grep beam
> drp 24197 0.2 4.6 562480 193724 pts/2 Ssl+ 17:58 0:44
> /home/drp/dripdb/erts-5.5/bin/beam -A32 -Bi -- -root /home/drp/dripdb
> -progname dripdb -- -home /home/drp -boot
> /home/drp/dripdb/releases/1.1/start -config
> /home/drp/dripdb/releases/1.1/sys -sname drpdb -mnesia dir
> "/home/drp/dripdb/var/data/mnesia" -heart -kernel dist_auto_connect once
>
> Once out of a blue moon (once a month or so) the heart port program
> detects closing of the read pipe from Erlang and restarts the emulator.
> The emulator doesn't show any issues with memory or any other
> problems. The number of running processes is small. The last messages
> in the console (logged by run_erl) were:
>
> ...
> [drpdb/drpdb:1102] Reading file "switch.txt" (3522 bytes)
> [drpdb/drpdb:1114] Read 105 lines (0.0 s).
> [drpdb/drpdb:1102] Reading file "dial_code.txt" (6700430 bytes)
> heart: Wed Dec 13 17:58:02 2006: Erlang has closed.
>
> Note that the "Reading file Filename" message is printed before doing
> file:consult(Filename), which is CPU consuming (parsing that file takes
> 15 seconds, during which CPU is at 100% on this multi-CPU machine). The
> last message above is printed by the "heart" program detecting a EOF on
> the read file descriptor, and therefore it brutally kills and restarts
> the emulator.
>
> What may cause the read on that file descriptor used by "heart" to
> return 0 (EOF)?
>
> We examined the code of heart.erl and heart.c, and think that the
> recovery protocol needs to be tuned a bit (though I am not so sure that
> this has anything to do with the nature of the problem) to enhance the
> false positive restarts. Upon detecting an EOF on the read fd, the
> heart should try to send a command to the emulator port process asking
> to restart the heart port program. If that write fails, only then
> restart the emulator. Otherwise wait for some time to allow the
> emulator to restart the port program (via erlang:port_close(Port)), and
> upon expiration of that timer restart the emulator. I am attaching the
> patch that does this logic.
>
> However, the main question on why "heart" detects disconnect from Erlang
> is still open.
>
> Regards,
>
> Serge
>
>
>
> ------------------------------------------------------------------------
>
> Index: lib/kernel/src/heart.erl
> ===================================================================
> --- lib/kernel/src/heart.erl (revision 21)
> +++ lib/kernel/src/heart.erl (revision 22)
> @@ -37,9 +37,11 @@
> -define(CLEAR_CMD, 5).
> -define(GET_CMD, 6).
> -define(HEART_CMD, 7).
> +-define(HEART_CYCLE, 8).
>
> -define(TIMEOUT, 5000).
> -define(CYCLE_TIMEOUT, 10000).
> +-define(PORT_CLOSE_DELAY, 2500).
>
> start() ->
> case whereis(heart) of
> @@ -178,6 +180,8 @@
> {From, cycle} ->
> %% Calls back to loop
> do_cycle_port_program(From, Parent, Port, Cmd);
> + {Port, {data, [?HEART_CYCLE | _]}} ->
> + do_cycle_port_program(self(), Parent, Port, Cmd);
> {'EXIT', Parent, shutdown} ->
> no_reboot_shutdown(Port);
> {'EXIT', Parent, Reason} ->
> @@ -215,6 +219,7 @@
> %% Huh! Two heart port programs running...
> %% well, the old one has to be sick not to respond
> %% so we'll settle for the new one...
> +
> send_heart_cmd(NewPort, Cmd),
> Caller ! {heart,{error, stop_error}},
> loop(Parent,NewPort,Cmd)
> @@ -245,7 +250,13 @@
> end.
>
> %% Sends shutdown command to the port.
> -send_shutdown(Port) -> Port ! {self(), {command, [?SHUT_DOWN]}}.
> +send_shutdown(Port) ->
> + Port ! {self(), {command, [?SHUT_DOWN]}},
> + spawn(fun() ->
> + receive
> + after ?PORT_CLOSE_DELAY -> catch port_close(Port)
> + end
> + end).
>
> %% We must report using erlang:display/1 since we don't know whether
> %% there is an error_logger available or not.
> Index: erts/etc/common/heart.c
> ===================================================================
> --- erts/etc/common/heart.c (revision 21)
> +++ erts/etc/common/heart.c (revision 22)
> @@ -169,8 +169,8 @@
> #define CLEAR_CMD 5
> #define GET_CMD 6
> #define HEART_CMD 7
> +#define HEART_CYCLE 8
>
> -
> /* Maybe interesting to change */
>
> /* Times in seconds */
> @@ -216,6 +216,7 @@
> static void do_terminate(int);
> static int notify_ack(int);
> static int heart_cmd_reply(int, char *);
> +static int heart_cmd_cycle(int fd);
> static int write_message(int, struct msg *);
> static int read_message(int, struct msg *);
> static int read_skip(int, char *, int, int);
> @@ -579,10 +580,15 @@
> break;
> }
> } else if (tlen == 0) {
> - /* Erlang has closed its end */
> - print_error("Erlang has closed.");
> - return R_CLOSED;
> - }
> + /* Erlang has closed its end */
> + print_error("Erlang has closed.");
> +
> + if (heart_cmd_cycle(erlout_fd) > 0) {
> + print_error("Notifying Erlang to restart heart port.");
> + sleep(5);
> + }
> + return R_CLOSED;
> + }
> /* Junk erroneous messages */
> }
> }
> @@ -785,7 +791,23 @@
> return write_message(fd, &m);
> }
>
> +/*
> + * send a request to recycle current heart port
> + *
> + * Sends a HEART_CYCLE
> + */
> +static int
> +heart_cmd_cycle(int fd)
> +{
> + struct msg m;
>
> + /* FIXME if s >= MSG_BODY_SIZE error */
> +
> + m.op = HEART_CYCLE;
> + m.len = htons(1);
> + return write_message(fd, &m);
> +}
> +
> /*
> * send back current command
> *
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
--
Serge Aleynikov
Routing R&D, IDT Telecom
Tel: +1 (973) 438-3436
Fax: +1 (973) 438-1464
More information about the erlang-questions
mailing list