[erlang-questions] How to identify the process which died and restart it?

Thu Jan 15 09:51:58 CET 2015

On 2015年1月14日 水曜日 05:52:41 Harit Himanshu wrote:
> Yes, because I am not on the chapter which teaches about it yet. My
> understanding is that the same semantics could be reaching without using it
> as well.
> 
> Having said that, I do believe supervisor would be the best solution, but
>  this problem has to be used without Supervisor. I guess the exercise is
> given to understand the semantics of Supervisor works

I'm working on a (very rough, still very immature) intermediate text that 
starts with a raw, relatively featureless, non-OTP codebase -- which might 
illustrate what you want to know.

First, understand the "genesis" function here:
https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/locman.erl#L33

It is using the list comprehension to pack a list of tuples like {Id, Pid}, 
where Id is the location's id within the game (usually a coordinate, but it 
could be any term) and the Pid belong's to the location process, which is 
linked. This managing process is trapping exits, so it will receive {'EXIT', 
Pid, Reason} messages when any one of the locations dies.

Second, check the receive clause here:
https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/locman.erl#L44-L46

Note the receive clause above it, which matches on this processes' manager 
(these are really *much* smaller supervisor/worker pairs in the OTP version of 
this, so "manager" equates to a supervisor). If an exit message does not come 
from the parent, then we must check if it is from one of the processes in that 
Live list that genesis/1 packed during initialization. That calls the 
handle_exit/3 function here:
https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/locman.erl#L68-L80

First we check if the Pid is present in the list of processes we think are 
live. If it is, we know a game location just died, and needs to be restarted. 
We first scrub the list to rid ourselves of the now dead entry in the live 
list, then get the configuration data for that location ID from the list of 
location conf entries (which may have changed since the room was started 
last), restart the room, and add its {ID, Pid} tuple back to the list, and 
return to our loop.

So the trick is all in how you interpret the exit message, and how you keep 
track of your linked or monitored processes. A very similar process happens in 
wayman, here:
https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/wayman.erl#L75-L85

But note that wayman is monitoring, not linking. This is because the locations 
are linked to their own ways, so the wayman is more of an ad hoc registry 
(which is completely unneccessary in the OTP version of this project) than a 
real supervisor. Because it maintains monitors, not links, it is receiving on 
any 'DOWN' message and checking its registry of live ways from there.

Hopefully this sheds a bit of light on how this sort of thing can be handled 
manually.

In 99% of cases you really will want to use supervisors and think through how 
you can best restore a stable, known state to things that die based on the 
facilities OTP provides than go to the effort of writing supervisory processes 
by hand all the time. The basic problem is that even if your hand-written 
supervisors are perfect all the time, they essentially wind up representing 
boilerplate that is often larger than the part of your code that actually 
solves user problems! Hand-written supervisors are also a tremendously 
tempting place to put "just one more feature" that supervisors shouldn't have 
to begin with -- thus increasing the chance your supervisors will themselves 
crash, potentially threatening your "crash kernel" (or at least bringing real 
disaster one step closer to it).

Please forgive the incomplete, convoluted, and poorly explained nature of the 
code and the project -- I'm nowhere near done with it (if only I had the 
time!).

-Craig