[erlang-questions] heart issue

Serge Aleynikov serge@REDACTED
Thu Dec 14 02:26:16 CET 2006


I was able to reproduce this problem using the steps below (this system 
doesn't have the heart patch applied that I attached in the last email). 
  First you need to create a 100M file "dial_code.txt" with some Erlang 
terms readable by file:consult/1, and then do:

$ echo "Process started: `date`" > test.txt
$ export HEART_COMMAND="echo \`date\` >> test.txt"

$ erl -heart +A10
heart_beat_kill_pid = 16734
Erlang (BEAM) emulator version 5.5 [source] [async-threads:10] [hipe]

Eshell V5.5  (abort with ^G)
1> _ = [spawn(fun() -> file:consult("dial_code.txt") end) || I <- 
lists:seq(1,50)], ok.
ok
%-- Serge: note the next line happens after about 15 minutes elapse. --
2> heart: Wed Dec 13 18:59:54 2006: Erlang has closed.

$ cat test.txt
Process started: Wed Dec 13 18:44:15 EST 2006
Wed Dec 13 18:59:55 EST 2006
$ uname -a
Linux drp01dev.corp.idt.net 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 
2005 i686 i686 i386 GNU/Linux

Serge

Serge Aleynikov wrote:
> Greeting!
> 
> I'd like to ask a question that someone might have dealt with in the 
> past.  We have a node running with a -heart option:
> 
> drp@REDACTED: ~$ pstree -p drp
> run_erl(24196)───beam(24197)─┬─heart(24236)
>                              
> в”њв”Ђinet_gethost(24243)в”Ђв”Ђв”Ђinet_gethost(24244)
>                              в””в”Ђsh(24257)
> 
> 
> drp@REDACTED: ~$ ps auxww | grep beam
> drp      24197  0.2  4.6 562480 193724 pts/2 Ssl+ 17:58   0:44 
> /home/drp/dripdb/erts-5.5/bin/beam -A32 -Bi -- -root /home/drp/dripdb 
> -progname dripdb -- -home /home/drp -boot 
> /home/drp/dripdb/releases/1.1/start -config 
> /home/drp/dripdb/releases/1.1/sys -sname drpdb -mnesia dir 
> "/home/drp/dripdb/var/data/mnesia" -heart -kernel dist_auto_connect once
> 
> Once out of a blue moon (once a month or so) the heart port program 
> detects closing of the read pipe from Erlang and restarts the emulator. 
>  The emulator doesn't show any issues with memory or any other 
> problems.   The number of running processes is small. The last messages 
> in the console (logged by run_erl) were:
> 
> ...
> [drpdb/drpdb:1102] Reading file "switch.txt" (3522 bytes)
> [drpdb/drpdb:1114] Read 105 lines (0.0 s).
> [drpdb/drpdb:1102] Reading file "dial_code.txt" (6700430 bytes)
> heart: Wed Dec 13 17:58:02 2006: Erlang has closed.
> 
> Note that the "Reading file Filename" message is printed before doing 
> file:consult(Filename), which is CPU consuming (parsing that file takes 
> 15 seconds, during which CPU is at 100% on this multi-CPU machine).  The 
> last message above is printed by the "heart" program detecting a EOF on 
> the read file descriptor, and therefore it brutally kills and restarts 
> the emulator.
> 
> What may cause the read on that file descriptor used by "heart" to 
> return 0 (EOF)?
> 
> We examined the code of heart.erl and heart.c, and think that the 
> recovery protocol needs to be tuned a bit (though I am not so sure that 
> this has anything to do with the nature of the problem) to enhance the 
> false positive restarts.  Upon detecting an EOF on the read fd, the 
> heart should try to send a command to the emulator port process asking 
> to restart the heart port program.  If that write fails, only then 
> restart the emulator.  Otherwise wait for some time to allow the 
> emulator to restart the port program (via erlang:port_close(Port)), and 
> upon expiration of that timer restart the emulator.  I am attaching the 
> patch that does this logic.
> 
> However, the main question on why "heart" detects disconnect from Erlang 
> is still open.
> 
> Regards,
> 
> Serge
> 
> 
> 
> ------------------------------------------------------------------------
> 
> Index: lib/kernel/src/heart.erl
> ===================================================================
> --- lib/kernel/src/heart.erl	(revision 21)
> +++ lib/kernel/src/heart.erl	(revision 22)
> @@ -37,9 +37,11 @@
>  -define(CLEAR_CMD, 5).
>  -define(GET_CMD, 6).
>  -define(HEART_CMD, 7).
> +-define(HEART_CYCLE, 8).
>  
>  -define(TIMEOUT, 5000).
>  -define(CYCLE_TIMEOUT, 10000).
> +-define(PORT_CLOSE_DELAY, 2500).
>  
>  start() ->
>      case whereis(heart) of
> @@ -178,6 +180,8 @@
>  	{From, cycle} ->
>  	    %% Calls back to loop
>  	    do_cycle_port_program(From, Parent, Port, Cmd);  
> +    {Port, {data, [?HEART_CYCLE | _]}} ->
> +	    do_cycle_port_program(self(), Parent, Port, Cmd);  
>  	{'EXIT', Parent, shutdown} ->
>  	    no_reboot_shutdown(Port);
>  	{'EXIT', Parent, Reason} ->
> @@ -215,6 +219,7 @@
>  		    %% Huh! Two heart port programs running...
>  		    %% well, the old one has to be sick not to respond
>  		    %% so we'll settle for the new one...
> +            
>  		    send_heart_cmd(NewPort, Cmd),
>  		    Caller ! {heart,{error, stop_error}},
>  		    loop(Parent,NewPort,Cmd)
> @@ -245,7 +250,13 @@
>      end.
>  
>  %% Sends shutdown command to the port.
> -send_shutdown(Port) -> Port ! {self(), {command, [?SHUT_DOWN]}}.
> +send_shutdown(Port) -> 
> +    Port ! {self(), {command, [?SHUT_DOWN]}},
> +    spawn(fun() -> 
> +              receive 
> +              after ?PORT_CLOSE_DELAY -> catch port_close(Port) 
> +              end 
> +          end).
>  
>  %% We must report using erlang:display/1 since we don't know whether
>  %% there is an error_logger available or not.
> Index: erts/etc/common/heart.c
> ===================================================================
> --- erts/etc/common/heart.c	(revision 21)
> +++ erts/etc/common/heart.c	(revision 22)
> @@ -169,8 +169,8 @@
>  #define  CLEAR_CMD    5
>  #define  GET_CMD      6
>  #define  HEART_CMD    7
> +#define  HEART_CYCLE  8
>  
> -
>  /*  Maybe interesting to change */
>  
>  /* Times in seconds */
> @@ -216,6 +216,7 @@
>  static void do_terminate(int);
>  static int notify_ack(int);
>  static int heart_cmd_reply(int, char *);
> +static int heart_cmd_cycle(int fd);
>  static int write_message(int, struct msg *);
>  static int read_message(int, struct msg *);
>  static int read_skip(int, char *, int, int);
> @@ -579,10 +580,15 @@
>  				break;
>  			}
>  		} else if (tlen == 0) {
> -		/* Erlang has closed its end */
> -		print_error("Erlang has closed.");
> -		return R_CLOSED;
> -    }
> +		    /* Erlang has closed its end */
> +		    print_error("Erlang has closed.");
> +            
> +			if (heart_cmd_cycle(erlout_fd) > 0) {
> +				print_error("Notifying Erlang to restart heart port.");
> +				sleep(5);
> +			}
> +			return R_CLOSED;
> +		}
>  		/* Junk erroneous messages */
>      }
>    }
> @@ -785,7 +791,23 @@
>    return write_message(fd, &m);
>  }
>  
> +/*
> + * send a request to recycle current heart port
> + *
> + * Sends a HEART_CYCLE
> + */
> +static int
> +heart_cmd_cycle(int fd)
> +{
> +  struct msg m;
>  
> +  /* FIXME if s >= MSG_BODY_SIZE error */
> +
> +  m.op = HEART_CYCLE;
> +  m.len = htons(1);
> +  return write_message(fd, &m);
> +}
> +
>  /*
>   * send back current command
>   *
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions

-- 
Serge Aleynikov
Routing R&D, IDT Telecom
Tel: +1 (973) 438-3436
Fax: +1 (973) 438-1464



More information about the erlang-questions mailing list