[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead

Wed Jan 11 09:03:23 CET 2017

Not to start a holy war, but Unicode is not complex, it just has a lot of
tables. And pretty much any modern language has immensely better Unicode
support built in than Erlang.
Thankfully there's ux.

About processes and mailboxes. Every process has its share of scheduler's
memory, which is used for mailbox, stack and heap. Send operation simply
writes to the end of mailbox. Receive matches all the inbox (there is a
temporary area too, the algorithm is clever to not block incoming messages)
and selects one according to the pattern. Receiving a message which
contains a freshly-created reference is optimized, otherwise it potentially
has to search the whole mailbox.

VM (BEAM) is actually several OS processes to allow for multi-core
execution. They share almost nothing, and each one is executing one Erlang
process at one time. When process switching happens, it's almost a matter
of one pointer to an internal structure changing; of course, as always with
virtual machines, the price is more indirection and slower code.

Strings are a mess. Generally there should only be lists of characters, and
binaries containing UTF-8, but some things work with lists of bytes and
other such bullshit. IO lists, for one, are generally deep lists of bytes
and binaries, but not always. In general, you have to be either very
careful with Unicode wrangling, or just give up like many other did before
you.

11 янв. 2017 г. 3:12 пользователь "Richard A. O'Keefe" <ok@REDACTED>
написал:

>
>
> On 11/01/17 7:58 AM, Bhag Chandra wrote:
>
>> 1) "Strings in Erlang are internally treated as a list of integers of
>> each character's ASCII values, this representation of string makes
>> operations faster. For example, string concatenation is constant time
>> operation in Erlang."  Can someone explain why?
>>
>
> Pretty much all of this is somewhere between misleading and confusing.
> Mind you, pretty much everything about strings in general is
> somewhere between confusing and misguided, whatever the programming
> language, especially since the introduction of Unicode.
>
> The closest thing to the truth is to say that Erlang has no native
> string datatype whatsoever.  Instead, there are several general purpose
> data types that are pressed into service to fake it.
>
> (1) An atom is a unique string.  There's more to say about them,
> but the general rule is DON'T use them for strings.
>
> (2) A list of character codes is stringy.  Once upon a time character
> codes were 8-bit quantities, so taking two computer words per character
> was sneered at.  Unicode has a 21-bit character set these days.  (Not
> only that, it's now racist and sexist, so that you can distinguish
> between pink/yellow male/female astronaut, I kid you not.  Why do we
> need a character "man in business suit levitating"?)  The astral planes
> are increasingly being populated, "Version 9.0 adds exactly 7,500
> characters, for a total of 128,172 characters."
>
> Let X = "1....m" and Y = "1...n".
> Then X ++ Y concatenates them in O(m) time.
> NOT O(1).
>
> (3) For many purposes, a *tree* of numbers and pairs can be used to
> hold text.  It is *this* that allows O(1) concatenation:
> [X|Y] *isn't* the string "1...m1...n" but can in certain contexts
> *represent* it.  You can, for example, build up a text of total
> size S in O(S) time and then flatten it (or write it or transmit it)
> in O(S) time.
>
> Look up the 'iolist()' type in Erlang documentation.
>
> (4) The 'binary' type was originally a byte string and is now a
> bit string.  A byte sequence can of course represent a string,
> with the encoding of the text being provided by context.  It
> used to be common to use Latin 1, it's now common to use UTF8.
> There is library code to help with this, but frankly, Unicode
> is so appallingly complex that there will probably never be
> enough library support (in any language).
>
> Concatenating byte strings of length m and n costs O(m+n)>
>
> (5) A tree of numbers and pairs could also contain binaries.
>
> The important thing about strings is that they are good at
> accepting data you don't care to inspect further, storing it,
> and giving it back later, and LOUSY at almost anything else,
> in EVERY programming language.
>
> One of the ways that they are lousy is that thanks to Unicode's
> rules, there is *structure* in strings (which were historically
> unstructured), so that you can have a well formed string of
> 2 characters, split it into two 1-character strings, and find
> that at least one of those strings is no longer legal.
> (Emoji flags, for example.)
>
>
>
>> 2) "It makes sense to use Erlang only where system's availability is
>> very high".  Is it not a very general requirement of most of the
>> systems? Whatsapp to Google to FB to Amazon to Paypal to Barclays etc
>> they all are high availability systems, so we can use Erlang in all of
>> them?
>>
>
> I do not think many people in this mailing list would say
> "it makes sense to use Erlang ONLY where availability [must be] high."
>
> People use Erlang (and other languages running on the Erlang VM,
> such as LFE and Elixir) for all sorts of things.  Yes, Erlang is
> not as fast as other languages.
>
> Let's have some perspective on that, though.
> Just today I measured four programs to solve the same problem,
> using the same structure, written in four languages.
> C          0.04 sec    -- has types
> Smalltalk  0.25 sec    -- lacks them
> Java       0.65 sec    -- has types
> Python     2.98 sec    -- lacks them.
> Have figures like that ever stopped anyone using Python?
> Have they even encouraged people to switch to Smalltalk?
>
> People use a programming language when it lets them get
> the job at hand done to an adequate standard.
> These days this is as much about infrastructure such as
> profiling, test coverage analysis, test frameworks,
> documentation tools, libraries, interface support (JSON,
> XML, ASN.1, Protobufs, whatever) as it is about the
> language as such.
>
> I find that it makes sense to use Erlang whenever concurrency
> is important/useful, because it's so much easier to get
> concurrent programs right than in Java or C11 or pretty much
> anything else (although Ada comes close).  There are of course
> plenty of libraries for other languages that claim to add
> Erlang-style abilities to those languages; what they don't
> mention is that you cannot remove dangers by adding libraries.
> For example, gets() has been deprecated in C; the OpenBSD
> linker tells you off if you use it.  There are safer alternatives.
> But nothing can stop you writing your own version of gets(),
> and the OpenBSD linker has no idea that mygets() is just as
> dangerous as gets() ever was.
>
> 3) "Every message which is sent to a process, goes to the mailbox of
>> that process. When process is free, it consumes that message from
>> mailbox". So how exactly does process ask from the mailbox for that
>> message? Is there a mechanism in a process' memory which keeps polling
>> its mailbox. I basically want to understand how message is sent from
>> mailbox to my code in process.
>>
>
> "When process is free" should presumably be
> "When that process is RUNNING and FEELS LIKE checking its mailbox,
>  that message MAY BE consumed if it is one the process WANTS to
>  consume."
>
> A process reads from its mailbox using the 'receive' construction,
> described in for example Learn You Some Erlang For Great Good
> (http://learnyousomeerlang.com) which I recommend.  This happens
> when the code decides it is time to try to receive a message.
>
> If you know about message passing in UNIX,
> Pid ! Msg       is like msgsnd(    Queue, Msg, Msg_Size, Flags)
>                      or mq_send(   Queue, Msg, Msg_Size, Prio)
> receive ... end is like msgrcv(    Queue, Msg, Msg_Size, Type, Flags)
>                      or mq_receive(Queue, Msg, Msg_Size, &Prio)
>
> That is, sending happens when there is something EXPLICIT in the
> code to make it happen, and receiving happens when there is
> something EXPLICIT in the code to make it happen, and there is
> buffering in between so things don't have to be simultaneous.
>
> 4) We say that a message is passed from process A to process B by simply
>> using a bang (!) character, but what happens behind the scenes to pass
>> this message? Do both processes establish a tcp connection first and
>> then pass message or what?
>>
>
> First of, we have Erlang "nodes" that are Unix or Windows "processes",
> and inside them we have Erlang "processes" that are like Unix or Windows
> "threads" except much cheaper; the word "fibres" is sometimes used to
> describe similar things in those environments.
>
> Erlang processes inside nodes on different machines may well use
> TCP connections automatically set up between the nodes.  Or they
> could use something else.  They could in principle use Infiniband
> or carrier pigeons.
>
> Erlang processes inside different nodes on the same machine
> could use any IPC facility provided by the host OS, such as
> System V message queues, POSIX message queues, pipes, UNIX sockets,
> ...
>
> Erlang processes inside the same node will use shared memory,
> which might or might not involve copying, depending on the Erlang
> version.  Whatever happens, it's likely to be cheaper than TCP.
>
> 5) At 30:25 in this video ( https://youtu.be/YaUPdgtUYko?t=1825 ) Mr.
>> Armstrong is talking about the difference between the context switching
>> overhead between OS threads and Erlang processes. He says, thread
>> context switching is of order 700 words but Erlang process context
>> switching is ... ?
>>
>
> He said that OS thread switching MOVES about 700 words,
> while Erlang process switching involves THREE REGISTERS.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170111/f73d2bd3/attachment.htm>