[erlang-questions] strings vs binaries

Fred Hebert mononcqc@REDACTED
Wed Aug 19 15:48:44 CEST 2015


On 08/19, zxq9 wrote:
>The protocols involved in this case are often textual and its somewhat 
>rare to find a textual protocol that isn't 100% ASCII.

This is less and less true and I wouldn't count on it for long. Hell for 
the sake of the argument, I'd say almost no protocol is 100% ASCII 
anymore.

Just think of all the protocols that include some form of JSON or XML; 
they most likely require unicode by specification or default.

HTTP has mostly ASCII header names, but if I recall, header values 
(other than those in the RFC) have their encoding unspecified and people 
will stick unicode stuff in there.  But most stuff in protocols that 
piggyback over HTTP ends up being in the payloads, where again, JSON or 
XML are common.

Then think of other protocols: they may use a binary format that is nice 
to work with (in Erlang) than most text protocols: the lengths are known 
(and hopefully specified in bytes, not characters), values are tagged, 
and so on. You get candidates like Thrift (expects utf8 by default in 
strings iirc), protocol buffers (lets you define types), BERT (has 
binaries and lists, so undefined), BSON (strings are utf8), etc.

Even in most of these, unicode encodings are very common. Now you may 
get away with not caring about encoding for a while, but it does not 
mean you're implementing the protocol correctly or that it will never 
happen.

DNS is a fun example where yes, the protocol is ASCII only, but they 
ended up having to put in place IDNA encoding (punycode) to allow 
unicode through.

The days of ASCII-only are almost over*, and the sun is setting really 
fast. There's just too many languages and cultures out there to want to 
be held back by ASCII.

* of course there are legacy systems that nothing will kill or bring 
forward in time, but what can you do.



More information about the erlang-questions mailing list