[erlang-questions] Why have a supervisor behaviour?
Fred Hebert
mononcqc@REDACTED
Fri May 22 16:19:12 CEST 2015
On 05/22, Roger Lipscombe wrote:
>It turns out that I probably don't need a supervisor at all, then.
>
> [project description]
>
>It seems, however, that I *don't* really want a supervisor to handle
>restarting the Squirrel VM; it looks like the host should do it, and I
>might be able to remove my custom supervisor in favour of a standard
>'simple_one_for_one' supervisor to handle crashes in the host process.
>Not sure about that last -- I don't want one process hitting max
>restart intensity to bring down the other host processes.
>
Ah that's interesting. To reason about this, one question to ask is:
what is it that your system guarantees to its subsequent processes. So
if you have some form of front-end or client handling the order of
spawning and restarting a VM (who do you do it on behalf of?), there's
likely a restricted set of operations you provide, right?
Something like:
- Run task
- Interrupt task
- Get task status or state report
- Has the task completed?
Or possibly, if you're going event-based, the following events are to be
expected:
- Task accepted
- VM booted
- VM failed
- Task aborted
- Task completion
Those are probably things you expect to provide and should work fine,
because those are the kinds of failures you do expect all the time from
the Squirrel VM itself. Furthermore, it's possible you'd eventually add
in a backpressure mechanism ("only 10 VMs can run at a time for a user")
or something like that. This means what you might want is the host
process to always be able to provide that information, and isolate your
user from the VM process' fickle behaviour.
So what does this tell us? What you guarantee when the supervision tree
is booted is therefore:
- I can contact the system to know if I can host a VM and run it
- Once I am given a process, there's a manager (the host process) I can
talk to or expect to get information from.
There is no guarantee about the Squirrel VM being up and running and
available; there's a good likelihood it's gonna be there, but in
reality, it can go terribly bad and we just can't pretend it's not gonna
take place.
This means that these two types of processes are those you want to be
ready and available as soon as 'init/1' has been executed. That a VM is
available or not is not core functionality; what's core is that you can
ask to get one, and know if it didn't work.
To really help figure this out, simply ask "Can my system still run if X
is not there?" If it can run without it, then your main recovery
mechanism should probably not be the supervisor through failed `init/1`
calls; it's a thing that likely becomes your responsibility as a
developer because it's a common event. It might need to move to
`handle_info/2`; If the system can't run without it, encode it in the
`init/1` function. It's a guarantee you have to make.
You'll find out that for some database connections, it's true. For some
it's not and the DB *needs* to be there for the system to make sense.
The supervisors then let you encode these requirements in your program
structure, and their boot and shutdown sequences. Same for anything you
may depend on.
Does this make sense?
Then to pick the exact supervision strategy and error handling
mechanism, you can ask yourself what do you do when the host process
dies. Can a new one take its place seemlessly? If not, then it's
possible the error needs to bubble up (through a monitor or some
message) to the caller so *they* decide whether to give up or try again.
If you can make it transparently or it's a best effort mechanism, then
yeah, just restarting the worker is enough.
"Let it crash" is a fun fun way to get going and to grow a system, but
when it has reached some level of growth, we can't avoid starting to
really reason about how we want things to fail; It lets us slowly
discover the properties we want to expose to our users, and after a few
solid crashes, it's entirely fine to reorganize a few bits of code to
reflect the real world and its constraints.
What's great is that we've goot all the building blocks and tools to
reason about it and implement the solution properly.
Regards,
Fred.
More information about the erlang-questions
mailing list