[erlang-questions] handling crash on db connect

Fri Jun 7 00:16:14 CEST 2013

Paul,

I made some tweaks, which I think are the Right Thing, thanks to your feedback!

- redis:connect is a pass-through to new redis_client:start functions

- Added redis:connect_link, which conforms to the old behavior

Here's an example process, which covers the points I made earlier wrt
a connection-manager/front-end to an underlying client process:

https://github.com/gar1t/erlang-redis/blob/master/examples/dbservice/src/dbservice.erl

How to use the example is here:

https://github.com/gar1t/erlang-redis/blob/master/examples/dbservice/README.md

On Thu, Jun 6, 2013 at 2:01 PM, Paul Rubin <paul@REDACTED> wrote:
> On Thu, Jun 6, 2013 at 9:49 AM, Garrett Smith <g@REDACTED> wrote:
>>
>> >> redis:connect is a wrapper for the process start_link....
>> The result {ok, Pid} simply means that the process has started and
>> initialized.
>
>
> Hey, thanks again, and aha, ok, this makes sense in terms of how start_link
> behaves.  It still comes as a surprise that redis:connect creates such a
> link though.  My program (typical for Erlang) accepts 1000's of client
> connections with a separate process for each one.  It uses redis for user
> authentication, so a new connection arrives, there is a redis lookup, and
> then the connection stays open "forever" (weeks, months...) without
> accessing redis again.  If redis is down, I'd expect new authentication
> attempts to fail, but old connections to keep running as long as they don't
> try to talk to redis.  Redis is never supposed to be down so I could imagine
> having this program run for a long time without ever encountering the issue.
> I don't have an automatic test for handling redis being down (I guess I
> should write one).  I noticed this with a manual test, but I thought part of
> the idea of OTP was to handle problems gracefully, including software bugs
> resulting from error conditions that weren't thought of during testing.

But if it *can* be down, it *will* be down, at some point :)

>> I didn't implement any of the start functions for the client process.
>> It's arguably bad form to not implement parallel start functions to
>> each start_link -- but the cases where start_link is *wrong* are quite
>> rare (none occur to me actually :)
>
>
> start_link seems wrong to me in this case, but I still find OTP somewhat
> murky so I can't claim to have that good intuition about what it should or
> shouldn't do.

Good use case for a start function, which we have now :)

>> Yes, though it's not the end of the world to handle crashed processes
>> explicitly -- supervisors have extremely naive recovery policies and
>> it's not uncommon to have to implement something fancier.
>
> I wonder if what's really going on here is a gap in the available
> supervision strategies.  I was a bit surprised to learn from the supervisor
> docs,
>
> "To prevent a supervisor from getting into an infinite loop of child process
> terminations and restarts, a maximum restart frequency is defined using two
> integer values MaxR and MaxT. If more than MaxR restarts occur within MaxT
> seconds, the supervisor terminates all child processes and then itself."
>
> I had somehow thought that if MaxR restarts happened in MaxT seconds, the
> supervisor would just sleep until MaxT seconds had passed, then start
> retrying again (i.e. limit the frequency rather than the absolute count).
> It does seem to me there should be an option for something like that.
> Restarts due to external process failures seem like a common situation to
> want to deal with, so (in my naive expectation) I'd think OTP should
> implement an infinite sleep/retry loop as one of its restart strategies.

There's a general lack-of-appetite to make supervisors any more
complicated than they are -- but this topic comes up once in a while.

Take a look at the dbservice application and the way a standard
supervisor is used with a retry delay in the supervised process -- I
use this pattern all the time and its prove pretty effective, for
these applications.

>> In the case of server connections, e.g. you almost always want to wait for
>> some
>> period before retrying, lest you DoS your servers :)
>
>
> Yes.  This is the common situation referred to above ;)
>>
>>
>> I'll do this -- thanks for pointing this out! This is a pretty central
>> problem that I glossed over.  I'll update the github issue when it's
>> pushed
>
> Great!   Thanks!

Thank you -- great input/feedback!

Garrett