[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead

Richard A. O'Keefe <>
Tue Jan 17 07:20:18 CET 2017



On 14/01/17 1:20 AM, Jesper Louis Andersen wrote:
> When I wrote my post, I was--probably incorrectly--assuming the older
> notion of a "String" where the representation is either ASCII or
> something like ISO-8859-15. In this case, a string coincides with a
> stream of bytes.

Just to be picky, an ASCII string is a sequence of 7-bit code points,
not bytes.  I have had the (dis)pleasure of working with an operating
system where for some unfathomable reason all ASCII characters were
coded as 128+official code point.  The OS is dead and they *deserved*
it.  *And* in ASCII, not all sequences are legal.  Technically,
according to the ASCII specification, NUL and DEL are ok on the wire,
but are not allowed inside a string.  (You thought Kernighan and
Ritchie came up with NUL termination all by themselves?  Nope, it was
official well before C: strings should not contain NUL or DEL.)  But
it doesn't end there.  ASCII supported multibyte characters.
Yes, really.  Some of the characters included in ASCII make sense
only when you realise that they were *supposed* to be positionable
diacriticals.  So for example
é = ' BS e
ô = ^ BS o (that's why ^ is in ASCII)
ç = , BS c
ñ = ~ BS n (that's why ~ is in ASCII)
ü = " BS u (so there are three uses for ")
This is for real.  It's not something I made up.  It's one of the
things that CHANGED between ASCII and ISO Latin 1.  ASCII *does*
allow making composite characters by overstriking and Latin 1 *doesn't*.

So even ASCII was a much more subtle thing that most people realise.
Anglophone monoglots didn't need accents, so software developed in
the Anglosphere tended to pretend that overstrikes didn't exist,
until the pretence became a de facto reality.

(Another thing the Unix designers have been slammed for is using LF
as line terminator instead of CR+LF.  But in fact "New Line" was one
of the legal readings for LF in the original ASCII design.  Oh
yeah, you could use CR for overstrikes as well...)

Anyone else remember using ASCII on TOPS-10?  Five 7-bit characters
per 36-bit word, so not every sequence of words could be represented
as a sequence of ASCII characters.  And the uses people found for
the extra bit?  Or the programming languages that let you specify
the byte size for an input file from 1 to 36 bits?  The Good Old
Days are *now*.

> Data needs parsing. A lot of data comes in as some kind of stringy
> representation: UTF-8, byte array (binary), and so on.

And some of it comes in fragments which need reassembly, and
some of it comes encrypted and/or compressed.

I am currently struggling with a class of 3rd year students
who have the ingrained belief, resulting from years of
abuse -3dw exposure to Java, that
  - it is a good idea to turn anything that should be the key
    of a hash table into a string
  - strings are cheap
  - building very long strings by repeated concatenation of
    small strings is a wonderful idea
  ...

I'm trying to get across to them the idea that
  - there are data representations you use at the BOUNDARIES
    of your program for input/output
  - there are data representations you use in your
    program for PROCESSING
  - there are different design forces involved, so
  - good representations for one purpose are seldom good
    representations for the other purpose.

For example, if you're going to stick with strings,
if someone hands you a "UTF-8" string and expects you
to give it back later, you should almost certainly
give it back VERBATIM, absolutely unchanged, but if
you want to process it, you probably want to convert
your working copy to some canonical form (which might
or might not be a string).



More information about the erlang-questions mailing list