[erlang-questions] How to identify the process which died and restart it?
zxq9
zxq9@REDACTED
Thu Jan 15 09:51:58 CET 2015
On 2015年1月14日 水曜日 05:52:41 Harit Himanshu wrote:
> Yes, because I am not on the chapter which teaches about it yet. My
> understanding is that the same semantics could be reaching without using it
> as well.
>
> Having said that, I do believe supervisor would be the best solution, but
> this problem has to be used without Supervisor. I guess the exercise is
> given to understand the semantics of Supervisor works
I'm working on a (very rough, still very immature) intermediate text that
starts with a raw, relatively featureless, non-OTP codebase -- which might
illustrate what you want to know.
First, understand the "genesis" function here:
https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/locman.erl#L33
It is using the list comprehension to pack a list of tuples like {Id, Pid},
where Id is the location's id within the game (usually a coordinate, but it
could be any term) and the Pid belong's to the location process, which is
linked. This managing process is trapping exits, so it will receive {'EXIT',
Pid, Reason} messages when any one of the locations dies.
Second, check the receive clause here:
https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/locman.erl#L44-L46
Note the receive clause above it, which matches on this processes' manager
(these are really *much* smaller supervisor/worker pairs in the OTP version of
this, so "manager" equates to a supervisor). If an exit message does not come
from the parent, then we must check if it is from one of the processes in that
Live list that genesis/1 packed during initialization. That calls the
handle_exit/3 function here:
https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/locman.erl#L68-L80
First we check if the Pid is present in the list of processes we think are
live. If it is, we know a game location just died, and needs to be restarted.
We first scrub the list to rid ourselves of the now dead entry in the live
list, then get the configuration data for that location ID from the list of
location conf entries (which may have changed since the room was started
last), restart the room, and add its {ID, Pid} tuple back to the list, and
return to our loop.
So the trick is all in how you interpret the exit message, and how you keep
track of your linked or monitored processes. A very similar process happens in
wayman, here:
https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/wayman.erl#L75-L85
But note that wayman is monitoring, not linking. This is because the locations
are linked to their own ways, so the wayman is more of an ad hoc registry
(which is completely unneccessary in the OTP version of this project) than a
real supervisor. Because it maintains monitors, not links, it is receiving on
any 'DOWN' message and checking its registry of live ways from there.
Hopefully this sheds a bit of light on how this sort of thing can be handled
manually.
In 99% of cases you really will want to use supervisors and think through how
you can best restore a stable, known state to things that die based on the
facilities OTP provides than go to the effort of writing supervisory processes
by hand all the time. The basic problem is that even if your hand-written
supervisors are perfect all the time, they essentially wind up representing
boilerplate that is often larger than the part of your code that actually
solves user problems! Hand-written supervisors are also a tremendously
tempting place to put "just one more feature" that supervisors shouldn't have
to begin with -- thus increasing the chance your supervisors will themselves
crash, potentially threatening your "crash kernel" (or at least bringing real
disaster one step closer to it).
Please forgive the incomplete, convoluted, and poorly explained nature of the
code and the project -- I'm nowhere near done with it (if only I had the
time!).
-Craig
More information about the erlang-questions
mailing list