[erlang-questions] State Management Problem

Sat Dec 19 06:53:41 CET 2015

On 2015年12月19日 土曜日 00:42:26 aman mangal wrote:
> Hi everyone,
> 
> I have been reading a few blogs on Erlang lately and some of them strongly
> points out that Erlang solves the reliability problem very nicely for
> distributed systems. But when I really think about it, Erlang solves only
> half of the reliability problem. It creates duplicate actors, handle their
> crash by linking and supervision but it does not handle the distributed
> state management problem at all. If I go back and look at the thesis of Joe
> Armstrong, it also talks about everything as an actor model. I am wondering
> what assumptions were made about state management at the time of creation
> of the language as well as what are good ways to handle the other half of
> the reliability problem when it comes to Erlang? I understand that this is
> a hard problem to solve but at the same time, it seems to be a generic
> problem for Distributed Systems. Does/can Erlang provide any generic
> solutions?

Short answer:

No.

tl;dr:

Three fundamental problems exist: consistency, availability, partition tolerance. Pick two. The Rules forbid solving all three at once.

Discussion:

The problems of distributed data are threefold, and only two can be solved at a time unless you happen to know how to either freeze time, open a wormhole or beat the speed of light. This is why there are no generic solutions to distributed data, only solutions that make tradeoffs of various types, and different tradeoffs are best suited to specific situations -- hence the impossibility of genericizing any solution.

The basic problem is described in the CAP theorem. It says a system can have:
- Consistency
- Availability
- Partition tolerance

but that you can only have 2 at once.

That doesn't mean that all parts of your system have to make the same tradeoff with regard to state management, but again, the fact that a tradeoff must be made is indication that there can never be a truly generic solution to this.

What Erlang lets you do is decide *for sure* whether something is running or crashing, instead of handling random faults in ad hoc ways. Tolerance for distributed failures is *also* something Erlang leaves up to the programmer to figure out, because the same CAP problem that exists in distributed state management also applied to the system's view of the state of its own operational capacity. (Does every node know what the state of every other node? That's data, too!)

So this is a hard problem. In the real world *most* systems seem to be designed to start involving humans once partitions occur (though most have the ability to run in a degraded state of service until a sysop fixes things). In the imaginary world where there is a software package to cure every ill, all our theories are correct, software is bug-free and network latency is zero this is handled automatically by correct implementations of logically flawless leader election algorithms that always work and a second partition never occurs in the middle of partition resolution. But we don't live in that world.

Partition tolerance is a hard problem, maybe the hardest to code around, so most systems seem to make a tradeoff that sacrifices (some level of) partition tolerance in exchange for (general, but maybe deferred) consistency and (an absolutely insane focus on) availability.

-Craig