[erlang-questions] distributed erlang machine crash problem
Fri Mar 27 07:51:05 CET 2015
> On 26 Mar 2015, at 14:42, Przemysław Wycisk <p.wycisk@REDACTED> wrote:
> We have two ideas how to solve that problem:
> - write an application (lets call it Y), which can store all needed data to restart a X application, there could be several instances of Y on several machines, which will have to be up to date with X application's state to restart it without lose of data,
> - keep exact mirror of X application on several machines, only one responsive, other only for keeping processes, and state data. Mirrors would just create mirror processes and update its state up to master X application.
> On master X application crash, one of mirrors would just "unlock" and run as master.
What I suggest you do first is to analyze your system requirements from a failure recovery standpoint: what level of service outage can you live with?
Mirroring state in real-time is of course possible, but as you’ve noted, very expensive.
In the old AXD 301, the basic redundancy mechanism was asynchronous stable-state replication, combined with an audit after failure. In that particular system, we had ca 15 seconds before protocol timers would start firing and a service outage becoming noticeable*. We went through the various data items and decided which data could be reconstituted from other data, and which couldn’t. We also assessed the probability of certain failures and calculated the amount of data loss that would be acceptable given the availability level we were aiming for.
In other words, we had neither a hot nor cold standby, but something in between - warm standby.
* This was possible because the actual data stream handling was physically separate from the control layer.
Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions