[erlang-questions] Why have a supervisor behaviour?

Fri May 22 16:19:12 CEST 2015

On 05/22, Roger Lipscombe wrote:
>It turns out that I probably don't need a supervisor at all, then.
>
> [project description]
>
>It seems, however, that I *don't* really want a supervisor to handle
>restarting the Squirrel VM; it looks like the host should do it, and I
>might be able to remove my custom supervisor in favour of a standard
>'simple_one_for_one' supervisor to handle crashes in the host process.
>Not sure about that last -- I don't want one process hitting max
>restart intensity to bring down the other host processes.
>

Ah that's interesting. To reason about this, one question to ask is: 
what is it that your system guarantees to its subsequent processes. So 
if you have some form of front-end or client handling the order of 
spawning and restarting a VM (who do you do it on behalf of?), there's 
likely a restricted set of operations you provide, right?

Something like:

- Run task
- Interrupt task
- Get task status or state report
- Has the task completed?

Or possibly, if you're going event-based, the following events are to be 
expected:

- Task accepted
- VM booted
- VM failed
- Task aborted
- Task completion

Those are probably things you expect to provide and should work fine, 
because those are the kinds of failures you do expect all the time from 
the Squirrel VM itself. Furthermore, it's possible you'd eventually add 
in a backpressure mechanism ("only 10 VMs can run at a time for a user") 
or something like that. This means what you might want is the host 
process to always be able to provide that information, and isolate your 
user from the VM process' fickle behaviour.

So what does this tell us? What you guarantee when the supervision tree 
is booted is therefore:

- I can contact the system to know if I can host a VM and run it
- Once I am given a process, there's a manager (the host process) I can 
  talk to or expect to get information from.

There is no guarantee about the Squirrel VM being up and running and 
available; there's a good likelihood it's gonna be there, but in 
reality, it can go terribly bad and we just can't pretend it's not gonna 
take place.

This means that these two types of processes are those you want to be 
ready and available as soon as 'init/1' has been executed. That a VM is 
available or not is not core functionality; what's core is that you can 
ask to get one, and know if it didn't work.

To really help figure this out, simply ask "Can my system still run if X 
is not there?" If it can run without it, then your main recovery 
mechanism should probably not be the supervisor through failed `init/1` 
calls; it's a thing that likely becomes your responsibility as a 
developer because it's a common event. It might need to move to 
`handle_info/2`; If the system can't run without it, encode it in the 
`init/1` function. It's a guarantee you have to make.

You'll find out that for some database connections, it's true. For some 
it's not and the DB *needs* to be there for the system to make sense.  
The supervisors then let you encode these requirements in your program 
structure, and their boot and shutdown sequences. Same for anything you 
may depend on.

Does this make sense?

Then to pick the exact supervision strategy and error handling 
mechanism, you can ask yourself what do you do when the host process 
dies. Can a new one take its place seemlessly? If not, then it's 
possible the error needs to bubble up (through a monitor or some 
message) to the caller so *they* decide whether to give up or try again.  
If you can make it transparently or it's a best effort mechanism, then 
yeah, just restarting the worker is enough.

"Let it crash" is a fun fun way to get going and to grow a system, but 
when it has reached some level of growth, we can't avoid starting to 
really reason about how we want things to fail; It lets us slowly 
discover the properties we want to expose to our users, and after a few 
solid crashes, it's entirely fine to reorganize a few bits of code to 
reflect the real world and its constraints.

What's great is that we've goot all the building blocks and tools to 
reason about it and implement the solution properly.

Regards,
Fred.