[erlang-questions] distributed erlang machine crash problem

Thu Mar 26 14:42:42 CET 2015

Consider n machines(servers), and one erlang node on each machine with
application X started.
In case when one machine crashes, we would want its X application to be
restarted on another machine (for example on separate erl process on one of
n-1 remaining machines).
The problem is that we need cache from crashed node to restart it in its
before (machine)crash state, we don't want to lose any "fly" data.

We have two ideas how to solve that problem:
- write an application (lets call it Y), which can store all needed data to
restart a X application, there could be several instances of Y on several
machines, which will have to be up to date with X application's state to
restart it without lose of data,
- keep exact mirror of X application on several machines, only one
responsive, other only for keeping processes, and state data. Mirrors would
just create mirror processes and update its state up to master X
application.
On master X application crash, one of mirrors would just "unlock" and run
as master.

In my opinion second idea is worse than first, because that requires all
operations to be executed twice or triply or etc. (on master and others
instances) and could bring huge performance loss, but renewal would work
immediately.
First idea seems good, but question is how fast restart could be then, we
don't have experience with solution like that.

What do you think about both solutions? Are there any proven solutions to
solve that problem?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150326/d8579890/attachment.htm>