[erlang-questions] question about supervisor (fail to restart one of the worker)

Sun Mar 30 05:33:23 CEST 2008

Hi, all,

I hope I can explain my question adequately clear.

I am working on a solution in which a supervisor will monitor a number of
generic workers.

To most messages, the worker will respond with an answer, but it may exit if
certain conditions are true. I want the supervisor to kill all existing
worker and restart them in case of any worker failure. Hence the use of
"all_for_one".

I tried to illustrated this problem with a simplified version of the
implementation. See the attached files.

sup.erl - the supervisor
worker.erl - the worker process. Implemented using gen_server. Whenever
message "stupid_question" is received, it will exit.

Because I want to have several instance of the worker, so,

1) I used an auxiliary function to create an ad-hoc server name for each
instance (process):

start(Name) ->
 ServerName = get_worker_id(Name),
 gen_server:start({local, ServerName}, ?MODULE, [ServerName], []).

get_worker_id(Id) when is_integer(Id) ->
  list_to_atom(?SERVER_FAMILY ++ integer_to_list(Id));

2) I construct the following child process spec in my supervisor

{ok, {{all_for_one, 3, 10},
[{w1,{worker,start_link,[1]},
     permanent,brutal_kill,worker,
     [worker]},
 {w2,{worker,start_link,[2]},
     permanent,brutal_kill,worker,
     [worker]},
 {w3,{worker,start_link,[3]},
     permanent,brutal_kill,worker,
     [worker]},
 {w4,{worker,start_link,[4]},
     permanent,brutal_kill,worker,
     [worker]}]
}}.

I have defined two test functions in sup.erl to facilitate testing (test1
and test2 respectively). *test1* is for testing in bash shell. *test2* is
for testing within erl.

If I run test1 as such,

erl -pa . -boot start_sasl -s sup test1  -run init stop -config log -noshell

I got the following errors:

in start/0
superviosr PID: <0.37.0>
Asking worker_1 a {good_question}
in worker:start_link/1. Param worker_1
reply: {answer}
Asking worker_1 a {stupid_question}
signal {noproc,{gen_server,call,
                           [worker_1,{ask_something,{stupid_question}}]}}
Asking worker_1 a {good_question}
{"init terminating in
do_boot",{noproc,{gen_server,call,[worker_1,{ask_something,{good_question}}]}}}

Crash dump was written to: erl_crash.dump
init terminating in do_boot ()

So, from what I can work out, when the worker process is dead after getting
a stupid_question, it is not started by the supervisor at all. Therefore got
a  noproc exception when the worker_1 is asked a good_question again.

If I changed the test case to use worker 2 instead of 1, then it is obvious
that the worker 2 to 4 are not started at all.

So, the questions I want to ask are
1) what mistakes I have made in this code?
2) Why worker 1 is not restarted?
3) Why worker 2 to 4 are not started at all?
4) the worker instance and gen_server: I used list_to_atom() to create
unique processes of worker. Is it a valid approach?

Cheers,

Anthony
-- 
/*--*/
Don't EVER make the mistake that you can design something better than what
you get from ruthless massively parallel trial-and-error with a feedback
cycle. That's giving your intelligence _much_ too much credit.

- Linus Torvalds
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080330/ea50af15/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: worker.erl
Type: application/octet-stream
Size: 1604 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080330/ea50af15/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.config
Type: application/xml
Size: 349 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080330/ea50af15/attachment.wsdl>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sup.erl
Type: application/octet-stream
Size: 1688 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080330/ea50af15/attachment-0001.obj>