[erlang-questions] Why is Erlang what it is?

Sat Dec 16 12:18:40 CET 2006

Den 2006-12-16 02:08:38 skrev Claus Reinke <claus.reinke@REDACTED>:

>>>>  Major upgrades in very large applications are often difficult to
>>>>  perform using the soft upgrade facilities alone. We tend more and
>>>>  more towards doing redundancy upgrades instead. One reason is that we
>>>>  have to be able to handle that anyway, and if we can do a hitless
>>>>  redundancy upgrade, why bother with lots of other techniques as well?
>>
>> For one thing, in a system with tens or hundreds of
>> thousand processes, it is very difficult to verify a
>> sequence where each process individually converts its
>> state. How do you do equivalence checks?
>
> pardon my ignorance, but I've been trying to figure out
> what you mean when you talk about redundancy upgrades.

Yes, apologies for not being clear.

> I've found this old message
>
> http://www.erlang.org/ml-archive/erlang-questions/200501/msg00249.html
>
> which suggests to me that it means something like this:
>
> 1. knock out a node to be upgraded
> 2. fault-tolerance and system redundancy will
>    kick in to take over the work
> 3. upgrade the node
> 4. restart it
> 5. rebalance workload to re-integrate the node

Yes, that's pretty much it. In (1), what we really do is
adminstratively block the node to be upgraded. Administrative
blocking means that applications are signaled to migrate as
smoothly as possible elsewhere, so that the node can be serviced.
This is a typical telecom thingy, and the smooth migration is
normally refered to as "takeover" (at least by us).

> I have two (groups of;-) questions about this:
>
> a) is this really the idea (roughly)? it seems that
>    "shock-testing" the running system and letting it
>    recover with fewer resources, as in 2, ought to be
>    more expensive (disruptive, risky) than a planned
>    upgrade.

It might seem so, but blocking/deblocking is standard
maintenance procedure, and is something we test extensively.
Furthermore, when you study what goes on during a failover/
takeover, it's not that complex, really. Soft upgrade in a
system with 50-100000 concurrent processes, on the other hand,
is a *very* complex procedure - in addition, it is very difficult
to determine whether it really went well.

>    or does redundancy mean shuttle-style, so that several
>    nodes are doing the same work, and one dropping out
>    won't matter (is there even support for such "hot"
>    backup in Erlang messaging?)?

Not for us, but yes, there is such support. Gen_leader is one
type of hot standby behaviour. Pg2 is another, I believe.

>     in that older message, Ulf seems to say that this
>     is one of many upgrade methods, each with
>     disadvantages, while being unaware of documented
>     guidelines as to when to upgrade how, and his
>     recent replies suggests that the move towards
>     redundancy upgrade is a question of planning
>     costs and doing "update in the large" rather
>     than "per process".

Something like that.

>     does that mean that redundancy upgrades are
>     still seen as not as "nice" as soft upgrades,
>     but cheaper to implement, and working around
>     practical limitations of the supposedly right way?

I think soft upgrades are extremely nice, and in the small,
also the cheapest method. Unfortunately, it doesn't quite
scale to support really large upgrades, where there may
be substantial changes to the whole environment - processes
being added or removed, or perhaps being rearranged for
better supervision characteristics; ets tables deleted, added
or restructured, mnesia tables deleted, added or transformed.
All of this can be done smoothly, but keeping track of it all,
while servicing traffic and honoring response requirements,
is a bear of a task.

> b) apart from reduced redundancy during 1-4, isn't there
> a risk that after 4&5, the system could become unstable
> if the upgraded node interacts badly with the old nodes
> (although you probably plan for node-level redundancy
> upgrades as you would for process-level soft upgrades)?

There is risk involved in all major upgrades. A huge
advantage here with redundancy upgrades is that it is much
easier to verify that the upgrade went well.

BR,
Ulf W

-- 
Ulf Wiger