[erlang-questions] Architectural quandaries - application supervisor?

Wed Oct 8 16:57:52 CEST 2014

On Wed, Sep 17, 2014 at 11:54 AM, David Welton <davidnwelton@REDACTED> wrote:
> On Tue, Sep 16, 2014 at 12:42 PM, Daniel Abrahamsson
> <daniel.abrahamsson@REDACTED> wrote:
>> To me it seems like what you want is a circuit breaker (like Jesper's fuse
>> [0], or breaky [1]) in front of the hardware module. Construct your Erlang
>> code so that it does not crash on hardware errors, but instead melts the
>> fuse. The supervisor is thus only involved when something strange happens in
>> your code and where a restart will bring you back to known state. We use
>> something similar (managed by [2], the name collision with [0] is purely
>> coincidental) for graceful degradation if our HSMs go down. We also use [2]
>> for managing database connections. In fact, you can use this strategy for
>> dealing with any kind of external service without risking taking down your
>> supervision tree.
>>
>> //Daniel
>>
>> [0] https://github.com/jlouis/fuse
>> [1] https://github.com/mmzeeman/breaky
>> [2] https://github.com/ulfl/fuse-lb
>
> Aha!
>
> Yes, that's probably what I want, or very close to it.  It's a pity
> the concept is not more widely documented, as it's very important for
> dealing with external services that may be down at some point that
> should not, however, pull Erlang down with them.  If it were up to me,
> I'd even put something like it in OTP, because it seems very likely
> that any large enough project will encounter a need like it.

While a circuit breaker might have been a good thing to use while
writing the original code, we didn't, and so we have a complete
application, with a reasonably nice supervision tree.

I'm experimenting with some code that manages applications: you tell
it to start an application, and it adds it to a list.  Then it listens
on events utilizing error_logger:add_report_handler(?SERVER), and
dispatches as a consequence.  This could be used to restart the dead
application, or simply to notify the user that things have gone really
badly and perhaps they should contact someone for help.  The system,
will, however, not crash and restart as it would if the application
were permanent, so the user could still access some diagnostic tools
and continue to use the system in a limited way.

In our case, the application would cover a large portion of the system
that interacts with the specialized  hardware - without it, the
machine can't do its job.  A circuit breaker could probably be
employed too, but I like the idea of handling "the hardware" as a
unit.  The supervision tree in that application already does a good
job of trying to restart individual pieces (there are a number of
them) in an appropriate way.

Thoughts?
-- 
David N. Welton

http://www.welton.it/davidw/

http://www.dedasys.com/