[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead

Wed Jan 11 01:12:01 CET 2017

On 11/01/17 7:58 AM, Bhag Chandra wrote:
> 1) "Strings in Erlang are internally treated as a list of integers of
> each character's ASCII values, this representation of string makes
> operations faster. For example, string concatenation is constant time
> operation in Erlang."  Can someone explain why?

Pretty much all of this is somewhere between misleading and confusing.
Mind you, pretty much everything about strings in general is
somewhere between confusing and misguided, whatever the programming
language, especially since the introduction of Unicode.

The closest thing to the truth is to say that Erlang has no native
string datatype whatsoever.  Instead, there are several general purpose
data types that are pressed into service to fake it.

(1) An atom is a unique string.  There's more to say about them,
but the general rule is DON'T use them for strings.

(2) A list of character codes is stringy.  Once upon a time character
codes were 8-bit quantities, so taking two computer words per character
was sneered at.  Unicode has a 21-bit character set these days.  (Not
only that, it's now racist and sexist, so that you can distinguish
between pink/yellow male/female astronaut, I kid you not.  Why do we
need a character "man in business suit levitating"?)  The astral planes
are increasingly being populated, "Version 9.0 adds exactly 7,500 
characters, for a total of 128,172 characters."

Let X = "1....m" and Y = "1...n".
Then X ++ Y concatenates them in O(m) time.
NOT O(1).

(3) For many purposes, a *tree* of numbers and pairs can be used to
hold text.  It is *this* that allows O(1) concatenation:
[X|Y] *isn't* the string "1...m1...n" but can in certain contexts
*represent* it.  You can, for example, build up a text of total
size S in O(S) time and then flatten it (or write it or transmit it)
in O(S) time.

Look up the 'iolist()' type in Erlang documentation.

(4) The 'binary' type was originally a byte string and is now a
bit string.  A byte sequence can of course represent a string,
with the encoding of the text being provided by context.  It
used to be common to use Latin 1, it's now common to use UTF8.
There is library code to help with this, but frankly, Unicode
is so appallingly complex that there will probably never be
enough library support (in any language).

Concatenating byte strings of length m and n costs O(m+n)>

(5) A tree of numbers and pairs could also contain binaries.

The important thing about strings is that they are good at
accepting data you don't care to inspect further, storing it,
and giving it back later, and LOUSY at almost anything else,
in EVERY programming language.

One of the ways that they are lousy is that thanks to Unicode's
rules, there is *structure* in strings (which were historically
unstructured), so that you can have a well formed string of
2 characters, split it into two 1-character strings, and find
that at least one of those strings is no longer legal.
(Emoji flags, for example.)

>
> 2) "It makes sense to use Erlang only where system's availability is
> very high".  Is it not a very general requirement of most of the
> systems? Whatsapp to Google to FB to Amazon to Paypal to Barclays etc
> they all are high availability systems, so we can use Erlang in all of them?

I do not think many people in this mailing list would say
"it makes sense to use Erlang ONLY where availability [must be] high."

People use Erlang (and other languages running on the Erlang VM,
such as LFE and Elixir) for all sorts of things.  Yes, Erlang is
not as fast as other languages.

Let's have some perspective on that, though.
Just today I measured four programs to solve the same problem,
using the same structure, written in four languages.
C          0.04 sec    -- has types
Smalltalk  0.25 sec    -- lacks them
Java       0.65 sec    -- has types
Python     2.98 sec    -- lacks them.
Have figures like that ever stopped anyone using Python?
Have they even encouraged people to switch to Smalltalk?

People use a programming language when it lets them get
the job at hand done to an adequate standard.
These days this is as much about infrastructure such as
profiling, test coverage analysis, test frameworks,
documentation tools, libraries, interface support (JSON,
XML, ASN.1, Protobufs, whatever) as it is about the
language as such.

I find that it makes sense to use Erlang whenever concurrency
is important/useful, because it's so much easier to get
concurrent programs right than in Java or C11 or pretty much
anything else (although Ada comes close).  There are of course
plenty of libraries for other languages that claim to add
Erlang-style abilities to those languages; what they don't
mention is that you cannot remove dangers by adding libraries.
For example, gets() has been deprecated in C; the OpenBSD
linker tells you off if you use it.  There are safer alternatives.
But nothing can stop you writing your own version of gets(),
and the OpenBSD linker has no idea that mygets() is just as
dangerous as gets() ever was.

> 3) "Every message which is sent to a process, goes to the mailbox of
> that process. When process is free, it consumes that message from
> mailbox". So how exactly does process ask from the mailbox for that
> message? Is there a mechanism in a process' memory which keeps polling
> its mailbox. I basically want to understand how message is sent from
> mailbox to my code in process.

"When process is free" should presumably be
"When that process is RUNNING and FEELS LIKE checking its mailbox,
  that message MAY BE consumed if it is one the process WANTS to
  consume."

A process reads from its mailbox using the 'receive' construction,
described in for example Learn You Some Erlang For Great Good
(http://learnyousomeerlang.com) which I recommend.  This happens
when the code decides it is time to try to receive a message.

If you know about message passing in UNIX,
Pid ! Msg       is like msgsnd(    Queue, Msg, Msg_Size, Flags)
                      or mq_send(   Queue, Msg, Msg_Size, Prio)
receive ... end is like msgrcv(    Queue, Msg, Msg_Size, Type, Flags)
                      or mq_receive(Queue, Msg, Msg_Size, &Prio)

That is, sending happens when there is something EXPLICIT in the
code to make it happen, and receiving happens when there is
something EXPLICIT in the code to make it happen, and there is
buffering in between so things don't have to be simultaneous.

> 4) We say that a message is passed from process A to process B by simply
> using a bang (!) character, but what happens behind the scenes to pass
> this message? Do both processes establish a tcp connection first and
> then pass message or what?

First of, we have Erlang "nodes" that are Unix or Windows "processes",
and inside them we have Erlang "processes" that are like Unix or Windows
"threads" except much cheaper; the word "fibres" is sometimes used to
describe similar things in those environments.

Erlang processes inside nodes on different machines may well use
TCP connections automatically set up between the nodes.  Or they
could use something else.  They could in principle use Infiniband
or carrier pigeons.

Erlang processes inside different nodes on the same machine
could use any IPC facility provided by the host OS, such as
System V message queues, POSIX message queues, pipes, UNIX sockets,
...

Erlang processes inside the same node will use shared memory,
which might or might not involve copying, depending on the Erlang
version.  Whatever happens, it's likely to be cheaper than TCP.

> 5) At 30:25 in this video ( https://youtu.be/YaUPdgtUYko?t=1825 ) Mr.
> Armstrong is talking about the difference between the context switching
> overhead between OS threads and Erlang processes. He says, thread
> context switching is of order 700 words but Erlang process context
> switching is ... ?

He said that OS thread switching MOVES about 700 words,
while Erlang process switching involves THREE REGISTERS.