[erlang-questions] Supervisor maximum restart frequency under high load

Mon Sep 24 14:45:43 CEST 2012

On Sep 24, 2012, at 1:27 PM, Ilyushonak Barys <barys_ilyushonak@REDACTED> wrote:
> I am kindly ask you to help me clarify the way I see the Erlang design pattern for the following issue.
> I have the server Erlang process called A, and many similar client processes, called B. The A process has two states – “ok” and “recovery”. While it “ok” – client can use the A API and do the stuff. While the A at the “recovery” the clients got the error and be restarted by the application supervisor.
> The my problem starts while I have a lot of B processes and “recovery” state of A takes time. The supervisor of B got the “Maximum Restart Frequency” and fails.
> While I am using application supervisor it brings me to restart my OTP application.
>  
> What the best way to fix it? Use the separate supervisor for the B? Should I handle the reply from the A in “recovery” state manually (what about fail early and often)?

In this case you should handle it, I think.

You have an expected behaviour here, where A may say "in recovery". When A is recovering, the B processes act in a specific way. They may close down - but do so with a 'normal' exit and make them transient in the supervisor so it doesn't make the supervisor crash due to the maximum restart frequency.

The key point about failing early and often is that you have a behaviour which is unexpected in the system. In the unexpected case, you have no way of knowing how to recover the process, so you are only able to crash and let restarts of the system handle the errornous case. But if the behaviour is expected, then you can choose to handle it explicitly.

Say I have a tcp connection. These may fail in spectacular ways. I could just make the system crash whenever such a failure happens in the beginning when I am developing the program. Basically, I've only implemented the "good happy path" through the program. Then you see crashes in the crash-log and you can begin handling each of these crashes by closing down the socket and exiting gracefully, reconnecting and so on. The idea is that you had a deliberately underspecified system until you realize what kinds of errors are common. And then you can handle those to shut up your crash log. This is important going forward in order to be able to detect real errors in the log from errors which are benign.

Jesper Louis Andersen
  Erlang Solutions Ltd., Copenhagen