recording process crash (in supervisor?)

Fri Sep 30 06:41:16 CEST 2005

On Thu, Sep 29, 2005 at 07:59:23AM -0400, Serge Aleynikov wrote:
> Rick,
> 
> Even though you don't seem to favor the addition of another event 
> handler, that is pretty much the only approach of getting custom 
> handling of crash reports.
> 
> As you correctly pointed out when there is a process crash, a supervisor 
> calls error_logger:error_report/2, which indeed is the candidate for a 
> custom callback.  Such a handler is very simple to implement (see 
> stdlib's error_logger_tty_h.erl).
> 
> What you can do is that you can add another child process to the 
> supervisor of interest, that can use 
> gen_event:add_sup_handler(error_logger, YourHandler, Args).  The 
> presence of  the child process (with appropriate {gen_event_EXIT, 
> YourHandler, _} message monitoring) will reinstall this handler in case 
> of crashes.

I am playing with a toy gen_event module which will serve as my handler. Here
is the handle_event/2 callback:

  %%--------------------------------------------------------------------
  %% Func: handle_event/2
  %% Returns: {ok, State}                                |
  %%          {swap_handler, Args1, State1, Mod2, Args2} |
  %%          remove_handler                              
  %%--------------------------------------------------------------------
  handle_event({error_report,_,{_,crash_report,[DataList,[]]}}=_Event,State) ->
    {value,{registered_name,RegisteredName}} =
      lists:keysearch(registered_name,1,DataList),
    {value,{error_info,ErrorInfo}} =
      lists:keysearch(error_info,1,DataList),
    io:format("event_h: process ~p crashed w/reason ~p~n",
              [RegisteredName, ErrorInfo]),
    {ok,State};
  handle_event(Event,State) ->
    io:format("event_h: recv'd event: ~p~n", [Event]),
    {ok,State}.

Now, when a process crashes this toy gen_event handler writes something like:

  "event_h: process foo crashed w/reason dunno\n"

Having to dig up the (undocumented?) format of the crash_report message (and
count on it not changing across releases) troubles me (duh). Surely what I
did must be "wrong", no?

Also, it now seems clear that I need _two_ processes in addition to the 
supervisor to do a job that a simple supervisor callback could do just as 
well--one child/worker process (to invoke gen_event:add_sup_handler/3) and
the actual gen_event handler process to receive and process error_logger
messages.

Am I doing something unconventional here (i.e. processing/recording process
crash info)? It seems like there should be an easier way. It also seems as
though my error_logger handler, which only really cares about crash_report
information, is going to have to "ignore" a whole lot of other messages which
a supervisor (callback/handler) wouldn't even see--this seems needlessly
inefficient.

The last point about efficiency almost makes the following (from original
post) look good (or at least better than it did):

    One approach which I have seen work but which seems cumbersome and
    unnecessary involved adding an additional process, under the top-level
    supervisor, with which all other application processes registered
    by name (at which time monitor/2 and/or link/1 were called). This
    additional process then listened for EXIT signals from registered
    processes and recorded their crash info. Since the supervisor is 
    already setup to receive all the crash info adding another process to 
    duplicate the functionality seemed silly to me.

It now seems less silly to have this additional process
link-to-and-trap-exits-for its siblings than it does to have it install a
third process to process *all* error_logger messages and pick out the crash
reports.

Someone feel free to smack me if I still "don't get it".

-Rick

> What puzzles me about this last approach is that neither error_logger or 
> SASL use supervised handlers for event reporting to screen.  This raises 
> a rhetorical question: if the implementation code is 100% correct, does 
> it mean that the process running this code doesn't require a supervisor? 
> Perhaps someone on the list can share his/her perception on this...
> 
> Serge
> 
> P.S. In a couple of weeks I am planning to make a contribution (LAMA - 
> Log and Alarm MAnager) that will demonstrate the use of this principle 
> for sending all error reports and alarms to syslog / snmp manager.
> 
> 
> Rick Pettit wrote:
> >I want to record application process crash info 
> >(proc_name/date/time/reason)
> >in an ETS table which persists as long as the top-level supervisor remains
> >alive. I realize I need to create the ETS table from the supervisor in 
> >order
> >to ensure it persists past all other application process crashes.
> >
> >What I don't know is if/where there is a hook for recording such 
> >information
> >from the supervisor. I don't see any supervisor callback which would allow
> >for recording of process crash info.
> >
> >I see supervisor.erl in stdlib appears to log this information to the
> >error_logger (when reason is not normal|shutdown):
> >
> >  do_restart(permanent, Reason, Child, State) ->
> >      report_error(child_terminated, Reason, Child, State#state.name),
> >      restart(Child, State);
> >  do_restart(_, normal, Child, State) ->
> >      NState = state_del_child(Child, State),
> >      {ok, NState};
> >  do_restart(_, shutdown, Child, State) ->
> >      NState = state_del_child(Child, State),
> >      {ok, NState};
> >  do_restart(transient, Reason, Child, State) ->
> >      report_error(child_terminated, Reason, Child, State#state.name),
> >      restart(Child, State);
> >  do_restart(temporary, Reason, Child, State) ->
> >      report_error(child_terminated, Reason, Child, State#state.name),
> >      NState = state_del_child(Child, State),
> >      {ok, NState}.
> >  ...
> >  ...
> >  ...
> >
> >  report_error(Error, Reason, Child, SupName) ->
> >      ErrorMsg = [{supervisor, SupName},
> >                  {errorContext, Error},
> >                  {reason, Reason},
> >                  {offender, extract_child(Child)}],
> >      error_logger:error_report(supervisor_report, ErrorMsg).
> >
> >If I want to process crash information (name/date/time/reason) when 
> >application
> >processes crash is the convention to install a custom handler via
> >error_logger:add_report_handler/[12]?
> >
> >My knee jerk reaction is that it would be awfully nice if the supervisor
> >behaviour simply provided a callback for processing process crash info. The
> >callback could even be spawn'd if risk of crashing the supervisor in the 
> >handler was a concern.
> >
> >Thanks for wading through the rambling--any comments/suggestions are much
> >appreciated.
> >
> >-Rick