[erlang-questions] Why is Erlang what it is?
Claus Reinke
claus.reinke@REDACTED
Sat Dec 16 02:08:38 CET 2006
>>> Major upgrades in very large applications are often difficult to
>>> perform using the soft upgrade facilities alone. We tend more and
>>> more towards doing redundancy upgrades instead. One reason is that we
>>> have to be able to handle that anyway, and if we can do a hitless
>>> redundancy upgrade, why bother with lots of other techniques as well?
>
> For one thing, in a system with tens or hundreds of thousand
> processes, it is very difficult to verify a sequence where each
> process individually converts its state. How do you do equivalence
> checks?
pardon my ignorance, but I've been trying to figure out what you mean
when you talk about redundancy upgrades. I've found this old message
http://www.erlang.org/ml-archive/erlang-questions/200501/msg00249.html
which suggests to me that it means something like this:
1. knock out a node to be upgraded
2. fault-tolerance and system redundancy will kick in to take over the work
3. upgrade the node
4. restart it
5. rebalance workload to re-integrate the node
I have two (groups of;-) questions about this:
a) is this really the idea (roughly)? it seems that "shock-testing" the running
system and letting it recover with fewer resources, as in 2, ought to be
more expensive (disruptive, risky) than a planned upgrade. or does
redundancy mean shuttle-style, so that several nodes are doing the
same work, and one dropping out won't matter (is there even support
for such "hot" backup in Erlang messaging?)?
in that older message, Ulf seems to say that this is one of many upgrade
methods, each with disadvantages, while being unaware of documented
guidelines as to when to upgrade how, and his recent replies suggests
that the move towards redundancy upgrade is a question of planning
costs and doing "update in the large" rather than "per process".
does that mean that redundancy upgrades are still seen as not as "nice"
as soft upgrades, but cheaper to implement, and working around
practical limitations of the supposedly right way?
b) apart from reduced redundancy during 1-4, isn't there a risk that after
4&5, the system could become unstable if the upgraded node interacts
badly with the old nodes (although you probably plan for node-level
redundancy upgrades as you would for process-level soft upgrades)?
assuming that redundancy means several nodes doing the same work
(monitoring to shut out deviating nodes rather than waiting for trouble
and other nodes being prepared to take on more work after one goes
down), has anyone considered doing upgrades on "shadow" nodes
rather than the "live" ones? in particular, would it be possible to
monitor the upgraded "shadow" network for discrepancies wrt
to the remaining "live" network, not permitting the upgraded "shadows"
to rejoin the "live" network unless they demonstrably cooperate for
some interval? or is that already part of the idea?
I'm sure that is all familiar stuff to you, but I'd like to expand my horizon;-)
Claus
More information about the erlang-questions
mailing list