[erlang-questions] Teaching Erlang as part of a paper -- advice sought

Tue Feb 9 10:02:27 CET 2010

On Mon, Feb 8, 2010 at 11:53 PM, Richard O'Keefe <ok@REDACTED> wrote:
>
> On Feb 8, 2010, at 10:18 PM, Joe Armstrong wrote:
>
>> Hi Richard,
>>
>> Error detection and recovery seems to be conspicuously absent from your
>> list.
>
> From the list, yes, from my thoughts, no.
> In fact just after sending that message I looked at it with some
> annoyance and realised I'd left that out.  I even had another
> "essence of Erlang" draft which had it in!
>
> If I were trying sufficiently hard to wriggle out of looking stupid,
> I'd say "I did mention building RELIABLE systems".
> But I'm not that kind of person (:-) (:-).
>
>> I think this point should be hammered home early and often.
>>
>>    "To build a fault-tolerant system you need at least two machines"
>>    "Let it crash"
>>
> There was something bothering me about everything else I looked at,
> including Google's "go".  I couldn't put my finger on it.
>
> THAT's what bothered me.
>
> Yes, Joe, you're right, and this is PRECISELY the kind of advice I
> needed.  This is EXACTLY the thing you don't find in the courses I
> had been looking at.
>
> It's fairly conventional to start a concurrent programming paper with
> very much a "single system" mindset and then move on to distribution
> later.  I was already planning to *start* with distribution as the
> model, and touch on CTHULHU programming later.  And of course doing
> it this way around will make it much much easier for me to follow
> your "this point ... early" advice.
>

Absolutly. One (open) question I ask  when I give a lecture is the following:

Suppose you want to design a system of (say) max 200 communicating
nodes (or agents, or whatever),
it will start with a small number of nodes (say 10) and grow with time
as demand increases

There could be three ways to do this:

1) design for one node and scale it up to 200
2) design for 200 nodes and scale it down to 10
3) design for an infinite (or very very large) number of nodes and
scale it down to 10

Question: Is the design the same?

Answer: No - I don't think so - I don't know - this is a thought
experiment not a real one.
I don't think you end up with the same architecture. From experience
building fixed limits into anything
is always wrong - if we design for an address space of zeta-bytes we
won't go far wrong.

(After I wrote this I checked - I was wrong - Mr. Google told me that
Eric Schmidt had said that the
digital data is growing at the rate of about 1 zetta byte/year)

I would design for a colossal address space and scale downwards - this
might be sub-optimal for small problems
but who cares - "small problems are not the problem" ie small problem
don't consume massive resources, so there is no point optimizing them.
Big problems do - so design for the worse case and scale down.

<aside>most problems *are* small which is why there is no point
optimizing them at all - anything that
consumes human keyboard input results in a small amount of data to
process - so *all* programs
like wiki-markup expanders, compilers etc. can be written in an
appallingly inefficient manner
and still do their jobs quickly enough. Non-human input (think video
etc) produce large amounts of data
and so processing needs to be efficient (today) - but I suspect this
will not be true in the near future
(think 1K cores) - fortunately image processing is easy to parallelism</aside>

The same is true for error recovery. Want a failure probability of
10^-100? Then take 34 *independent*
machines with a failure probability of 10^-3 - they chance they all
fail at the same time is 10^-102.

When we start talking in terms of 5_nines reliability we've missed the
plot, we're walking in a forest but we don't see the trees. To make a
system with 5 nines reliability we design for 1000 nines reliability
and scale downwards
NOT design for 3 nines and scale up.

If you shift your perspective and ask how can I make things infinitely
reliable, infinitely scalable and
scale down you will ask the right questions and get a decent design.

Then  the question is "how can I make an infinity reliable or scalable
system" and the answer to both
questions is the same - make the bits independent. if the bits are
independent you can scale them
by replicating, and make them fault-tolerant (by replication). So
architecturally the identification of the
independent components becomes the thing you have to look for to make
a scalable or fault-tolerant system.
make it very scalable and you (almost) get the fault-tolerance for
free (just add, links, pepper, and a bit of
chilli powder, bring to the boil and stir every ten minutes till ready)

/Joe

>
>