[erlang-questions] handling crash on db connect

Fri Jun 7 14:54:34 CEST 2013

Ah this is an interesting bit that I wanted to respond to. I'll do so
inline.

On 06/06, Paul Rubin wrote:
> 
> OK, that is another case of my expectations being wrong, but it calls into
> question why I went to the trouble of learning Erlang in the first place
> ;)  Just about all other languages stop the program when something fails,
> but I thought the idea of Erlang was to keep going no matter what!  Of
> course I want it to get my attention, but I thought that it did that by
> writing "CRASH REPORT" into the sasl log in ALL CAPS so I could grep for it
> the next morning, instead of blowing up my entire service when a
> non-essential part of the program hits a glitch.  The whole idea of
> supervision, hot code patches, appups, relups, etc. is to be able to write
> programs that are just NEVER DOWN, even if the code has bugs.  Anything
> that makes the VM crash for reasons short of hardware meltdown seems wrong
> from that point of view.
> 

Erlang's idea is not to keep going no matter what, it's to keep going
even though there are individual failures. This doesn't mean that any
and all amount of failing is equal. Would you expect your node to
blindly try to start on its own to infinity without ever booting if the
problem was that the code cannot be found/loaded? What if you didn't
even call a supervisor at the top of your start function for the
application behaviour?

It *is* a useful feature to be able to tell a supervisor to give up
after a while. You can do so by specifying the MaxR and MaxT according
to what you want. If you never want it to die, you could always use a
MaxR of 999999999 over a MaxT of 1 second and hope the code can't fail
fast enough and get the behaviour you want. You may kill the node due to
too much logging and going out of memory on your error logger, though.

Moreover, once you have divided your system into OTP applications, you
can choose to know which one is vital or not. Each OTP application can
be started in 3 ways: temporary, transient, permanent, either by doing
it manually in application:start(Name, Type), or in the config for
releases in systools or reltool.

- permanent: if the app terminates, the entire system is taken down,
  excluding manual termination of the app with application:stop/1.
- transient: if the app terminates for reason 'normal', that's ok. Any
  other reason for termination shuts down the entire system.
- temporary: the application is allowed to stop for any reason. It will
  be reported, but nothing bad will happen.

Note that OTP will not restart all applications indefinitely. It will
just stop running them (restarting apps is their top supervisor's
responsibility). If your node can really live without an app, you
may want to make it 'temporary'.

Nothing in there says "OTP shuts down the node all the time" nearly as
much as the configuration you give it.

> If you start to look at your Erlang applications as ecosystems of
> independent services, you can start to think about shoring up each
> service to improve its availability, performance, etc. -- just as one
> might in a service oriented architecture. In the case of your Redis
> service, you want something that advocates for your Redis DB
> availability. That advocate (an Erlang process) can deal with
> connections, retries, error handling, etc. as an intelligent facade to
> the underlying Redis client process (which can some and go at any time
> depending on network, Redis server availability, etc.)
> 
> Of course I'd expect to have to shore up the individual services as
> problems with them arose and patterns got identified, but in the meanwhile
> I'd expect a service being down to just mean I couldn't use it.  Like if
> Github is down for some reason, I'd expect to get a connection error if I
> try to access it from the office, but I wouldn't expect Github's outage to
> set off a demolition for the whole building my office is in.  That's sort
> of what start_link does, from what I gather.  The failure path should
> instead be isolated by default, propagating only to the parts of the code
> that actually depend on the things that failed.  So it seems to me, using
> start_link in a client isn't right, unless the fates of the two interacting
> processes are actually entwined in a way that neither one could conceivably
> run without the other.  Other independent processes should be able to keep
> running.

start_link is synchronous for one simple reason: so you can know that
whatever was started in the supervision tree is generally available when
it's spawned.

This is useful within a single app for the case where, for example, you
have internal dependencies:

         __________[app_supersup]
        /                |
  [app_ets_sup]      [app_sup]
  |   |   |   |          |
  [ets tables ]      [app_workers]

In this case, synchronous supervisors mean that app_ets_sup is started
and has started its ETS tables when app_sup comes up, and I can safely
assume my resources are there without any tricky code that polls for
their existence in the app_workers' init functions. Similarly, these
guarantees generally extend to other applications (because they also use
supervisors), so I know that when I started 'gproc' or 'poolboy', I
should be able to rely on them internally, without repeatedly looking
for some registered process and hoping it comes up soon. It reduces the
complexity of my own code.

Now when we look at external dependencies (services, databases, etc.) we
have many ways to design things, and here's part of the problem you had:

Requiring a connection to be established for 'init' to work (and hence
for start_link to return successfully) implicitly means that other
follower applications should be able to expect the connection to be
there when they boot.

This may make sense, if for example, you're going on a loopback
interface that should not be broken (UDP, TCP, who cares) because it's
local. These do not need an 'advocate'.

If you expect that the connection can logically be impossible to
establish and that you should expect failures, then by all means do not
return it as part of init as a guarantee. Just guarantee that a manager
for the connection will be available to tell you whether it's online or
not. Yes, this means you'll need an asynchronous start, but you'll most
likely need a 'reconnect' piece of code in your gen_server any way
because you expect the connection to fail at some point.

So instead of doing:

    init(Args) ->
        Opts = parse_args(Args),
        {ok, Port} = connect(Opts),
        {ok, #state{sock=Port, opts=Opts}}.

    [...]

    handle_info(reconnect, S = #state{sock=undefined, opts=Opts}) ->
        case connect(Opts) of
            {ok, New} -> {noreply, S#state{sock=New}};
            _ -> self() ! reconnect, {noreply, S}
        end;

You do:

    init(Args) ->
        Opts = parse_args(Args),
        %% you could try connecting here anyway, for a best effort thing
        %% but be ready to not have a connection.
        self() ! reconnect,
        {ok, #state{sock=undefined, opts=Opts}}.

    [...]

    handle_info(reconnect, S = #state{sock=undefined, opts=Opts}) ->
        case connect(Opts) of
            {ok, New} -> {noreply, S#state{sock=New}};
            _ -> self() ! reconnect, {noreply, S}
        end;

And otherwise the code doesn't really change, because you expected
failures anyway.

You just changed the kind of guarantees you gave to applications and
processes started after yours from "the connection is available" to "the
connection manager is available". Then document that somewhere, either
in a README.md or EDoc or anything else.

This 'failure path' isolation is yours to do, because you, as the
application developer, have the knowledge to know what is a good error
or not to break on. You may, for example, want to crash if you cannot
have permissions to edit files on disk while you expect to. Now putting
more control back into your hands does mean you have more
responsibilities and more ways to make mistakes, but it is a very small
change that basically lets you start your application as permanent
instead of temporary.

Most of the hard decision making where you and I will generally make
mistakes will be when code calls your connection manager: Am I able to
get a connection? Am I able to send something over the connection or did
it fail? Did I get the value I expected out of it?  Did the connection
fail after I received part of my answer? Did it just time out? Not
dealing with these right will have the potential to kill your own app,
even though the connection manager is there and runs well, no matter if
it were able to have the connection in its init or not. That does not
change depending on how supervisors work.

This answer got to be much longer than I originally anticipated. I hope
this helps, though!

Regards,
Fred.