[erlang-questions] handling crash on db connect

Thu Jun 6 18:49:29 CEST 2013

On Thu, Jun 6, 2013 at 11:35 AM, Paul Rubin <paul@REDACTED> wrote:
> On Thu, Jun 6, 2013 at 8:58 AM, Garrett Smith <g@REDACTED> wrote:
>>
>> Hi Paul,
>> Sorry I'm not used to seeing questions about my Redis bindings
>> graduate to the main list :)
>
>
> Hey, thanks for the reply.  I asked on the list first because I didn't want
> to bug you on github until I was more confident that I wasn't doing
> something silly.
>
>>
>> redis:connect is a wrapper for the process start_link. The return
>> value is standard for an OTP process: either {ok, Pid} or {error,
>> Reason}.
>
>
> Yes, this is what I expected, but in fact when I say
>     A = redis:connect().
> when the redis server is not running, I get a crash instead of
> {error,Reason}.  Is that what's supposed to happen?  I'm still not clear on
> this, as it conflicts with what you say further down.  What's the point of
> returning {ok, Pid} instead of just Pid in the non-error case, if there's no
> possibility of returning anything without ok, in the error case?

This is an excellent question!

The result {ok, Pid} simply means that the process has started and
initialized. You'll get {error, Reason} if there was a problem
starting the process (e.g. you're using a registered name that is
already running).

If you're curious about the API, take a look around the gen_server
docs, in particular the start_link call:

http://erlang.org/doc/man/gen_server.html#start_link-3

In the case of this Redis client, the process has started, but it
crashed. Because this is a start_link (emphasis on link) your calling
process also crashes because of the link. A call to start, e.g. would
not trigger this -- you'd get {ok, Pid} return value, and then the
process Pid would immediately terminate.

I didn't implement any of the start functions for the client process.
It's arguably bad form to not implement parallel start functions to
each start_link -- but the cases where start_link is *wrong* are quite
rare (none occur to me actually :)

>> Not a dumb error at all -- but neither a bug. This is by design and is
>> pretty common for OTP processes. In particular, risky code gets
>> executed in the context of the process (for the sake of proper
>> isolation) and calling processes need to trap exit to deal with
>> problems, or just let it all crash and get restarted by the
>> supervisor.
>
>
> OK, I guess I can trap exit, but I had thought of trapexit itself as being
> risky and generally best left to the OTP supervision libraries except for
> special circumstances, and a database being down is relatively normal.  The
> "let it crash" approach would be
>    {ok, Pid} = redis:connect().
> which would crash with a pattern match failure in the case of an error
> return.

Yes, though it's not the end of the world to handle crashed processes
explicitly -- supervisors have extremely naive recovery policies and
it's not uncommon to have to implement something fancier. In the case
of server connections, e.g. you almost always want to wait for some
period before retrying, lest you DoS your servers :)

>>
>>
>> The question of how to handle connection problems can be tricky. I
>> typically bake this into a "connection handler" type process that
>> indeed traps exit and then figures out what to do -- other simply
>> let's the client process exit propagate up to the supervisor. I'll
>> typically have a retry logic that waits for a period of time after
>> failures, logging attempts, errors, etc.
>
>
> Yes, that seems like the right thing, retry every few seconds until the db
> is back.  I just hadn't thought of trapping exit as part of it (as opposed
> to just checking for error value). I actually do have a separate gen_server
> making a persistent connection in its init/1 and holding onto it, and then
> other parts of my program call this gen_server which in turn makes the redis
> call.  When the init crashes, my top supervisor restarts the gen_server
> immediately, this repeats until MaxR runs out, and the whole VM crashes.  It
> came as a shock that a fairly routine error case could cause this to happen.
> I find myself wishing for a general additional OTP supervision strategy
> (one_for_one_delay, say) that on crash would attempt restart no more than
> once per retry period (e.g. 1 second).  I had kind of thought that was what
> MaxT does, but I guess not.
>>
>>
>> I can provide an example of this type of process -- or maybe something
>> like this would be appropriate as a utility within the library
>
>
> This would be great, thanks!

I'll do this -- thanks for pointing this out! This is a pretty central
problem that I glossed over.

I'll update the github issue when it's pushed.

Garrett