[erlang-questions] question how to recover a 'stateful' app when Erlang node crashes?

Tue Jan 17 16:34:35 CET 2012

On Mon, Jan 16, 2012 at 3:52 PM, Roman Shestakov
<romanshestakov@REDACTED> wrote:
> hello,
>
> I have a problem with Erlang VM running out of memory and dying. With this I
> have a question about correct recovery mechanisms which involve multiple
> VMs.
>
> what is the correct way to recover "stateful" Erlang application? In my
> case, the app. which is crashing is a complex hierarchy of fsm_processes
> each containing certain state. I understand how to recover stateless
> processes with supervisors but what is the correct way to recovery stateful
> apps? Clearly in my case I probably need some kind of supervisor 'node' but
> what would be the steps to correctly recover killed processes with their
> states? do I need to use a db and replay the processes from disk on another
> node or can I have a node with identical processes hierarchy?

This is a very good question. I'd love to hear from the seasoned vets
on this one.

My take is, first, this problem highlights the cost of state in the
first place. The best solution is to avoid recovering state
altogether. This usually means pushing more responsibility
responsibility to your users. An academic but illustrative example is
the web session state problem: rather than spend incredible energy on
the back end managing client sessions, use client cookies.

Of course, if that's not something you can wriggle out of, you have
this problem.

I generally use dets tables to store process state, which then can be
used for recovery on process startup. I'll typically use on dets file
per process. This is pretty easy -- you just save off your state every
time it's updated.

As for the recovery, I think you have two options: the recovered
process can restore its own state, or another process can restore it.
This depends on how long it takes to restore the state and whether
your app can tolerate the recovered process being unavailable for that
period. If it's too complex or time consuming, use a separate process
to do that work, feeding the recovered process state appropriately.

In the interest of separating concerns, it's probably always better to
pair the process with a separate "state serializer" process, though I
often just bake that into the process itself if its simple/fast
enough.

Garrett