<div dir="ltr">On Thu, Jun 6, 2013 at 9:49 AM, Garrett Smith <span dir="ltr"><<a href="mailto:g@rre.tt" target="_blank">g@rre.tt</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
>> redis:connect is a wrapper for the process start_link....<br></div>
The result {ok, Pid} simply means that the process has started and<br>
initialized. </blockquote><div><br></div><div>Hey, thanks again, and aha, ok, this makes sense in terms of how start_link behaves. It still comes as a surprise that redis:connect creates such a link though. My program (typical for Erlang) accepts 1000's of client connections with a separate process for each one. It uses redis for user authentication, so a new connection arrives, there is a redis lookup, and then the connection stays open "forever" (weeks, months...) without accessing redis again. If redis is down, I'd expect new authentication attempts to fail, but old connections to keep running as long as they don't try to talk to redis. Redis is never supposed to be down so I could imagine having this program run for a long time without ever encountering the issue. I don't have an automatic test for handling redis being down (I guess I should write one). I noticed this with a manual test, but I thought part of the idea of OTP was to handle problems gracefully, including software bugs resulting from error conditions that weren't thought of during testing.<br>
<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
I didn't implement any of the start functions for the client process.<br>
It's arguably bad form to not implement parallel start functions to<br>
each start_link -- but the cases where start_link is *wrong* are quite<br>
rare (none occur to me actually :)<br></blockquote><div><br></div><div>start_link seems wrong to me in this case, but I still find OTP somewhat murky so I can't claim to have that good intuition about what it should or shouldn't do. <br>
</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div><br>
</div>Yes, though it's not the end of the world to handle crashed processes<br>
explicitly -- supervisors have extremely naive recovery policies and<br>
it's not uncommon to have to implement something fancier. </blockquote><div><br></div><div>I wonder if what's really going on here is a gap in the available supervision strategies. I was a bit surprised to learn from the supervisor docs,<br>
<br><div style="margin-left:40px">"To prevent a supervisor from getting into an infinite loop of
child process terminations and restarts, a <i>maximum restart frequency</i> is defined using two integer values <code>MaxR</code>
and <code>MaxT</code>. If more than <code>MaxR</code> restarts occur within
<code>MaxT</code> seconds, the supervisor terminates all child
processes and then itself."<br></div><br></div><div>I had somehow thought that if MaxR restarts happened in MaxT seconds, the supervisor would just sleep until MaxT seconds had passed, then start retrying again (i.e. limit the frequency rather than the absolute count). It does seem to me there should be an option for something like that. Restarts due to external process failures seem like a common situation to want to deal with, so (in my naive expectation) I'd think OTP should implement an infinite sleep/retry loop as one of its restart strategies.<br>
</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
In the case
of server connections, e.g. you almost always want to wait for some<br>
period before retrying, lest you DoS your servers :)<br></blockquote><div><br></div><div>Yes. This is the common situation referred to above ;)<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>I'll do this -- thanks for pointing this out! This is a pretty central<br>
problem that I glossed over. I'll update the github issue when it's pushed</blockquote><div><br></div><div>Great! Thanks!<br><br></div><div>(re followup message):<br><br><div style="margin-left:40px">
What you observed I think is a very healthy problem -- a surprising<br>
catastrophic failure! This is a thing of beauty because it calls<br>
attention to a serious problem: you're relying on something that<br>
suddenly isn't working. Rather than lure you into a false sense of<br>
confidence, Erlang's default answer is to STOP. Now what? Dunno, but<br>
it got your attention :)<br><br></div>OK, that is another case of my expectations being wrong, but it calls into question why I went to the trouble of learning Erlang in the first place ;) Just about all other languages stop the program when something fails, but I thought the idea of Erlang was to keep going no matter what! Of course I want it to get my attention, but I thought that it did that by writing "CRASH REPORT" into the sasl log in ALL CAPS so I could grep for it the next morning, instead of blowing up my entire service when a non-essential part of the program hits a glitch. The whole idea of supervision, hot code patches, appups, relups, etc. is to be able to write programs that are just NEVER DOWN, even if the code has bugs. Anything that makes the VM crash for reasons short of hardware meltdown seems wrong from that point of view.<br>
<br><div style="margin-left:40px">
If you start to look at your Erlang applications as ecosystems of<br>
independent services, you can start to think about shoring up each<br>
service to improve its availability, performance, etc. -- just as one<br>
might in a service oriented architecture. In the case of your Redis<br>
service, you want something that advocates for your Redis DB<br>
availability. That advocate (an Erlang process) can deal with<br>
connections, retries, error handling, etc. as an intelligent facade to<br>
the underlying Redis client process (which can some and go at any time<br>
depending on network, Redis server availability, etc.)<br><br></div>Of course I'd expect to have to shore up the individual services as problems with them arose and patterns got identified, but in the meanwhile I'd expect a service being down to just mean I couldn't use it. Like if Github is down for some reason, I'd expect to get a connection error if I try to access it from the office, but I wouldn't expect Github's outage to set off a demolition for the whole building my office is in. That's sort of what start_link does, from what I gather. The failure path should instead be isolated by default, propagating only to the parts of the code that actually depend on the things that failed. So it seems to me, using start_link in a client isn't right, unless the fates of the two interacting processes are actually entwined in a way that neither one could conceivably run without the other. Other independent processes should be able to keep running.<br>
<br></div><div>Regards<br>Paul<br></div></div></div></div>