[erlang-questions] Newbie question, finite state Machine failover

Wed Sep 7 16:42:49 CEST 2011

I should also say that some protocols allow for a "recovery window", which further simplifies stable-state replication. I know that some people will overlook this and try for hottest possible redundancy, but in several of the products I've been involved in, this property has been crucial.

If you have resources allocated in hardware (e.g. forwarding engines, DSP:s, etc.), you will need to audit the resource management situation after a failure. If session state information is missing at some level, this may force you to release the entire session, even if most of the data is still there. Traditional Telecoms protocols expect sessions to have a data plane component and a control plane component, and will tolerate temporary loss of the control signaling. Normally, you will have 10-15 seconds to get your house in order and respond sanely to status inquiries.

This is actually part of the secret behind the high availability figures achieved in Telecoms. Even the interaction with end users is designed to be fault-tolerant, giving the serving infrastructure some margin for recovery.

The separation of the rather sensitive data path and the complex but fault-tolerant control plane is also important. Data processing units are kept as simple as possible, and are fault-isolated from the control plane. This model is becoming blurred by the trend towards tighter integration, but as a mental model, it is good to recall what the benefits of separation were.

BR,
Ulf W

On 7 Sep 2011, at 16:27, Thomas Elsgaard wrote:

> On Wed, Sep 7, 2011 at 9:51 AM, Ulf Wiger
> <ulf.wiger@REDACTED> wrote:
>> 
>> On 6 Sep 2011, at 23:09, Jon Watte wrote:
>> 
>>> Stateful, as in the fail-over needs to be "hot" and "online" and replicating the state of the first application faithfully?
>>> 
>>> The danger with such approaches is that, if the state becomes corrupt through some chain of events, then the replicated copy may also be corrupt, and the "slave" crashes when the "master" crashes. It still works great in case of hardware failure on the master instance, of course.
>> 
>> You are right. One way to mitigate this is to put some effort into designing a replication format, which is not just mirroring the internal state. Not only will this reduce the likelihood of propagating corrupted state; it will also simplify potential future upgrades and extensions, and make it easier to analyse the traffic flowing between nodes.
>> 
>> One should also think through at which points it is at all meaningful to replicate. I like to refer to "stable-state replication", which doesn't really say anything about the frequency of updates, but rather highlights that there are usually discrete points where recovery from error is meaningful. The transition states between these points tend to be volatile, and replicating them may serve little purpose.
>> 
>> BR,
>> Ulf W
>> 
>> Ulf Wiger, CTO, Erlang Solutions, Ltd.
>> http://erlang-solutions.com
>> 
>> 
>> 
>> 
> 
> Hi All
> 
> Thanks for the input, there is no easy way ;-) But i will take it into
> my considerations
> 
> ///Thomas

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com