[erlang-questions] supervisor with obstinate restart policy — are there any implementations?
Wed Jun 27 21:18:00 CEST 2012
On 06/26/2012 12:43 PM, Max Lapshin wrote:
> I think that many people have met with OTP supervisor problem: if your
> supervisor must work with external resource,
> and this resource is down, you get system, brought down after some restarts.
> I think that there are many implementations of trackers, that restart
> such jobs and thus reimplement OTP supervisors.
> Have anyone implemented supervisor that is OTP compatible, and doesn't
> fail on frequent worker restarts,
> but starts to restart less and less frequent?
I did some work previously on adding incremental backoff to the OTP
supervisors, but in this case you describe, I think that what you need
is not a special supervisor, but a Circuit Breaker (see
http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern). The idea
with supervision is that it is often the case that a restart will fix
temporary problems and glitches, by resetting the workers to a known
good state. But when it comes to depending on external resources, your
supervisor cannot restart the external resource - it can only restart
your connection to the resource. If that wasn't the problem, you're
One way of implementing a circuit breaker in Erlang is as a separate
server that acts as a middleman. Anybody who wants to call the external
service has to make the request via the circuit breaker. The circuit
breaker runs the jobs, tracks status of jobs and detects timeouts, logs
warnings, and can decide to block further requests for a while if the
external resource seems to be misbehaving, so your logs don't get
flooded by a million workers simultaneously discovering that your SMS
provider (or whatever) is unavailable. You should also be able to query
the circuit breaker about the current status of the resources it
monitors, force block/unblock, etc.
A generic circuit breaker would be a nice addition to the Erlang
libraries, but it's not a supervisor - it's a service, which in itself
needs to be supervised (because the rest of the system depends on it).
More information about the erlang-questions