call right after failing cast

Thu Apr 30 00:27:49 CEST 2020

On Wed, Apr 29, 2020 at 10:53 PM Nato <nato@REDACTED> wrote:
>
> Hi list,
>
> I have some sequential code that casts to a
> gen_server, whereby this cast fails, say
> 50% of the time, by design. On the following
> line of code, I make a call to the same
> gen_server, and I'm getting errors (sorry,
> the exact error isn't available as I'm using
> elli webserver without the logging middleware).
>
> When I put a `timer:sleep(300)` between the
> case and call, I never get errors. This was
> not what I want to do, but I pointed to
> something that I'm confused about.
>
> If a registered gen_server (really, nothing
> fancy going on with it, and its init is
> very straightforward) falls over, how does one deal
> with waiting for it to come back online?
>
> I thought messages would sit in /some/
> mailbox somewhere, not just error out.

The mailbox buffering messages to a process is part of the state for
that process, not a separate object (aka channel).  If the target
process terminates for whatever reason its mailbox disappears.  You
cannot message a non-existing process.

If you expect the gen_server to crash and be restarted by its
supervisor on a regular basis, then you need to add retry logic to its
callers.  You could wrap the call or cast with code to do a few
retries in case of noproc errors, and possibly also in case of calls
getting timeouts (to cater for the case where the server was alive
when the message was sent, but restarted before replying).  This can
all be hidden from client code in the bodies of the API functions
towards the gen_server.

While it's possible to hide some of the effects of crashes/restarts,
there are costs and risks associated with then, so it's still
preferable to avoid crashing servers that have non-trivial
availability requirements.