[erlang-questions] handling crash on db connect

Thu Jun 6 21:01:28 CEST 2013

On Thu, Jun 6, 2013 at 9:49 AM, Garrett Smith <g@REDACTED> wrote:

>  >> redis:connect is a wrapper for the process start_link....
> The result {ok, Pid} simply means that the process has started and
> initialized.

Hey, thanks again, and aha, ok, this makes sense in terms of how start_link
behaves.  It still comes as a surprise that redis:connect creates such a
link though.  My program (typical for Erlang) accepts 1000's of client
connections with a separate process for each one.  It uses redis for user
authentication, so a new connection arrives, there is a redis lookup, and
then the connection stays open "forever" (weeks, months...) without
accessing redis again.  If redis is down, I'd expect new authentication
attempts to fail, but old connections to keep running as long as they don't
try to talk to redis.  Redis is never supposed to be down so I could
imagine having this program run for a long time without ever encountering
the issue.  I don't have an automatic test for handling redis being down (I
guess I should write one).  I noticed this with a manual test, but I
thought part of the idea of OTP was to handle problems gracefully,
including software bugs resulting from error conditions that weren't
thought of during testing.

I didn't implement any of the start functions for the client process.
> It's arguably bad form to not implement parallel start functions to
> each start_link -- but the cases where start_link is *wrong* are quite
> rare (none occur to me actually :)
>

start_link seems wrong to me in this case, but I still find OTP somewhat
murky so I can't claim to have that good intuition about what it should or
shouldn't do.

>
> Yes, though it's not the end of the world to handle crashed processes
> explicitly -- supervisors have extremely naive recovery policies and
> it's not uncommon to have to implement something fancier.

I wonder if what's really going on here is a gap in the available
supervision strategies.  I was a bit surprised to learn from the supervisor
docs,

"To prevent a supervisor from getting into an infinite loop of child
process terminations and restarts, a *maximum restart frequency* is defined
using two integer values MaxR and MaxT. If more than MaxR restarts occur
within MaxT seconds, the supervisor terminates all child processes and then
itself."

I had somehow thought that if MaxR restarts happened in MaxT seconds, the
supervisor would just sleep until MaxT seconds had passed, then start
retrying again (i.e. limit the frequency rather than the absolute count).
It does seem to me there should be an option for something like that.
Restarts due to external process failures seem like a common situation to
want to deal with, so (in my naive expectation) I'd think OTP should
implement an infinite sleep/retry loop as one of its restart strategies.

> In the case of server connections, e.g. you almost always want to wait for
> some
> period before retrying, lest you DoS your servers :)
>

Yes.  This is the common situation referred to above ;)

>
> I'll do this -- thanks for pointing this out! This is a pretty central
> problem that I glossed over.  I'll update the github issue when it's pushed

Great!   Thanks!

(re followup message):

What you observed I think is a very healthy problem -- a surprising
catastrophic failure! This is a thing of beauty because it calls
attention to a serious problem: you're relying on something that
suddenly isn't working. Rather than lure you into a false sense of
confidence, Erlang's default answer is to STOP. Now what? Dunno, but
it got your attention :)

OK, that is another case of my expectations being wrong, but it calls into
question why I went to the trouble of learning Erlang in the first place
;)  Just about all other languages stop the program when something fails,
but I thought the idea of Erlang was to keep going no matter what!  Of
course I want it to get my attention, but I thought that it did that by
writing "CRASH REPORT" into the sasl log in ALL CAPS so I could grep for it
the next morning, instead of blowing up my entire service when a
non-essential part of the program hits a glitch.  The whole idea of
supervision, hot code patches, appups, relups, etc. is to be able to write
programs that are just NEVER DOWN, even if the code has bugs.  Anything
that makes the VM crash for reasons short of hardware meltdown seems wrong
from that point of view.

If you start to look at your Erlang applications as ecosystems of
independent services, you can start to think about shoring up each
service to improve its availability, performance, etc. -- just as one
might in a service oriented architecture. In the case of your Redis
service, you want something that advocates for your Redis DB
availability. That advocate (an Erlang process) can deal with
connections, retries, error handling, etc. as an intelligent facade to
the underlying Redis client process (which can some and go at any time
depending on network, Redis server availability, etc.)

Of course I'd expect to have to shore up the individual services as
problems with them arose and patterns got identified, but in the meanwhile
I'd expect a service being down to just mean I couldn't use it.  Like if
Github is down for some reason, I'd expect to get a connection error if I
try to access it from the office, but I wouldn't expect Github's outage to
set off a demolition for the whole building my office is in.  That's sort
of what start_link does, from what I gather.  The failure path should
instead be isolated by default, propagating only to the parts of the code
that actually depend on the things that failed.  So it seems to me, using
start_link in a client isn't right, unless the fates of the two interacting
processes are actually entwined in a way that neither one could conceivably
run without the other.  Other independent processes should be able to keep
running.

Regards
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130606/4e3d54cc/attachment.htm>