[erlang-questions] Why is Erlang what it is?

Sat Dec 16 02:08:38 CET 2006

>>>  Major upgrades in very large applications are often difficult to
>>>  perform using the soft upgrade facilities alone. We tend more and
>>>  more towards doing redundancy upgrades instead. One reason is that we
>>>  have to be able to handle that anyway, and if we can do a hitless
>>>  redundancy upgrade, why bother with lots of other techniques as well?
>
> For one thing, in a system with tens or hundreds of thousand
> processes, it is very difficult to verify a sequence where each
> process individually converts its state. How do you do equivalence
> checks?

pardon my ignorance, but I've been trying to figure out what you mean
when you talk about redundancy upgrades. I've found this old message

http://www.erlang.org/ml-archive/erlang-questions/200501/msg00249.html

which suggests to me that it means something like this:

1. knock out a node to be upgraded
2. fault-tolerance and system redundancy will kick in to take over the work
3. upgrade the node
4. restart it
5. rebalance workload to re-integrate the node

I have two (groups of;-) questions about this:

a) is this really the idea (roughly)? it seems that "shock-testing" the running
    system and letting it recover with fewer resources, as in 2, ought to be
    more expensive (disruptive, risky) than a planned upgrade. or does
    redundancy mean shuttle-style, so that several nodes are doing the
    same work, and one dropping out won't matter (is there even support
    for such "hot" backup in Erlang messaging?)?

    in that older message, Ulf seems to say that this is one of many upgrade
    methods, each with disadvantages, while being unaware of documented
    guidelines as to when to upgrade how, and his recent replies suggests
    that the move towards redundancy upgrade is a question of planning
    costs and doing "update in the large" rather than "per process".

    does that mean that redundancy upgrades are still seen as not as "nice"
    as soft upgrades, but cheaper to implement, and working around
    practical limitations of the supposedly right way?

b) apart from reduced redundancy during 1-4, isn't there a risk that after
    4&5, the system could become unstable if the upgraded node interacts
    badly with the old nodes (although you probably plan for node-level
    redundancy upgrades as you would for process-level soft upgrades)?

    assuming that redundancy means several nodes doing the same work
    (monitoring to shut out deviating nodes rather than waiting for trouble
    and other nodes being prepared to take on more work after one goes
    down), has anyone considered doing upgrades on "shadow" nodes
    rather than the "live" ones? in particular, would it be possible to
    monitor the upgraded "shadow" network for discrepancies wrt
    to the remaining "live" network, not permitting the upgraded "shadows"
    to rejoin the "live" network unless they demonstrably cooperate for
    some interval? or is that already part of the idea?

I'm sure that is all familiar stuff to you, but I'd like to expand my horizon;-)

Claus