heart anyone?

Martin Bjorklund mbj@REDACTED
Thu Oct 23 12:14:13 CEST 2003


Francesco Cesarini <francesco@REDACTED> wrote:
> 
> 
> Dear Erlangers,
> 
> before I head off to reinvent the wheel, I was wondering if any one has 
> implemented their own version of heart. What we are looking for is a 
> behaviour similar to the supervisor one, where you can allow a maximum 
> of X local restarts of the beam emulator in Y seconds. Possibly after X 
> restarts, the OS is rebooted, erlang started up again, and if it 
> crashes, the whole system just stops. Needless to say cyclic restarts 
> have been a problem..
> 
> Any other thoughts on heart are welcome. Past problems, praises, horror 
> stories, et all.

Yes, we do this, but there's no need to hack heart.  Instead we do
what we have to do in the HEART_COMMAND, i.e. the command that heart
calls when the node goes down.  This command is a shell script which
first checks if we've rebooted too many times, and if so gives up.
Otherwise erlang is started.

It shouldn't be difficult to also do a OS reboot here.


Here's the interesting part of that script:

#
# Execute this function to make
# sure that we don't get into a reboot cycle.  If we
# reboot more than 6 times, each time less than 20
# minutes since the last reboot, we give up, and
# don't try to reboot again.
# Use this function only if heart is used and we're
# rebooting on the permenanent release.
# In case of a safe restart (i.e. not in the reboot
# interval) remove the reboot file used by isdstart.
#
check_reboot() {
  Restarts=0;
  Timestamp=`timestamp`;
  Month=`month`;
  ExternalRebootFile=$dir/reboot.isd

  if [ -w $RebootFile ]; then
    LastTimestamp=`awk '{print $1}' $RebootFile`;
    LastMonth=`awk '{print $2}' $RebootFile`;
    Restarts=`awk '{print $3}' $RebootFile`;
    Diff=`expr $Timestamp - $LastTimestamp`;
    if [ "$Month" = "$LastMonth" ]; then
      if [ $Diff -lt $Timespan ]; then
        # We rebooted too early
        if [ $Restarts -ge $MaxRestarts ]; then
          # We rebooted too early too many times - give up
          echo "`date`: Too many reboots - giving up" >> $LogFile;
	  # Keep RebootFile as we otherwise removes ExternalRebootFile
	  # the next time started.
	  echo "$Timestamp $Month 0" > $RebootFile;
          exit 0;
        fi
        Restarts=`expr $Restarts + 1`;
      else
        Restarts=0;
      fi
    else
      Restarts=0;
    fi
  fi
  if [ $Restarts = 0 -a -w $ExternalRebootFile ]; then
    # We rebooted outside the reboot interval
    rm $ExternalRebootFile;
  fi
  echo "$Timestamp $Month $Restarts" > $RebootFile;
  exit 1;
}

timestamp() {
  D=`date '+%d'`;
  H=`date '+%H'`;
  M=`date '+%M'`;
  expr $D \* 1440 + $H \* 60 + $M
}

month() {
  date '+%y%m'
}





/martin



More information about the erlang-questions mailing list