[erlang-questions] strings vs binaries
Fred Hebert
mononcqc@REDACTED
Wed Aug 19 15:48:44 CEST 2015
On 08/19, zxq9 wrote:
>The protocols involved in this case are often textual and its somewhat
>rare to find a textual protocol that isn't 100% ASCII.
This is less and less true and I wouldn't count on it for long. Hell for
the sake of the argument, I'd say almost no protocol is 100% ASCII
anymore.
Just think of all the protocols that include some form of JSON or XML;
they most likely require unicode by specification or default.
HTTP has mostly ASCII header names, but if I recall, header values
(other than those in the RFC) have their encoding unspecified and people
will stick unicode stuff in there. But most stuff in protocols that
piggyback over HTTP ends up being in the payloads, where again, JSON or
XML are common.
Then think of other protocols: they may use a binary format that is nice
to work with (in Erlang) than most text protocols: the lengths are known
(and hopefully specified in bytes, not characters), values are tagged,
and so on. You get candidates like Thrift (expects utf8 by default in
strings iirc), protocol buffers (lets you define types), BERT (has
binaries and lists, so undefined), BSON (strings are utf8), etc.
Even in most of these, unicode encodings are very common. Now you may
get away with not caring about encoding for a while, but it does not
mean you're implementing the protocol correctly or that it will never
happen.
DNS is a fun example where yes, the protocol is ASCII only, but they
ended up having to put in place IDNA encoding (punycode) to allow
unicode through.
The days of ASCII-only are almost over*, and the sun is setting really
fast. There's just too many languages and cultures out there to want to
be held back by ASCII.
* of course there are legacy systems that nothing will kill or bring
forward in time, but what can you do.
More information about the erlang-questions
mailing list