[erlang-questions] unicode: what about printing terms?

Patrik Nyblom pan@REDACTED
Tue Oct 30 16:16:26 CET 2012


On 10/22/2012 05:01 PM, Vlad Dumitrescu wrote:
> Hi!
>
> On Mon, Oct 22, 2012 at 4:42 PM, Patrik Nyblom <pan@REDACTED> wrote:
>> On 10/22/2012 03:28 PM, Vlad Dumitrescu wrote:
>> How is printing going to work when atoms/variables can be unicode?
>>
>> Well, when reading a binary, you need to know if it's UTF-8 or latin1, but
>> you know that a "string" (a list interpreted as text) or an atom in R18
>> always contain "Unicode" (latin1 codepoints is a subset of Unicode
>> codepoints). The io module translates things to Erlang Unicode
>> representation if needed and sends it to the io_server. The io_server in
>> turn decides how to output this. Either in UTF-8 if it's a Unicode capable
>> terminal (or a Werl window, where the driver for the window then converts it
>> further to 16bit calls... *shudder*) or in any encoding set for a file. If
>> the file is restricted to latin1, Unicode characters > 255 cannot be output
>> (exception error:no_translation), if it's a eight-bit terminal they will be
>> output as \{...}. The need for ~ts is solely for how to interpret the
>> *input* data, the io_server is responsible for translating it to the output
>> device.
>>
>> Maybe the two documents in stdlib users guide:
>> http://www.erlang.org/doc/apps/stdlib/users_guide.html
>> can help clear up the things I seem to be unable to explain properly.
> I used http://www.erlang.org/doc/man/io.html#fwrite-1 as reference and
> there ~s and ~ts are documented as options for output... I think the
> problem is that we're talking about slightly different things :-)
>
> So it means that for files, the encoding is defined when opening them
> and for the console it is whatever the environment sets it to (and
> good luck if there's a mismatch with the sent data)? When debugging a
> live telecom node one often has to go through several gateways, and
> not all of them have new OS versions with UTF-8 support, I hope that
> they just pass the data as-is and not mangle it.
You can set the encoding of the terminal yourself with io:setopts/{1,2} 
if you want something else than the environment states. You can also 
always run a latin1 terminal if you want to. If the lines are not 
eight-bit clean, you run into the same trouble with characters > 127 
regardless of encoding.
>
> And when encoding terms to external format, how will atom names be
> encoded? We must be able to read them from external programs too (Java
> nodes, C nodes, etc) and from older versions of Erlang.
We will of course update jinterface, erl_interface/ei and IC, as we 
always do when we extend the external format or the distribution 
protocol. There is already an unused tag for "large" atoms in the 
external format, so it's even simple technically.
> regards,
> Vlad
Cheers,
/Patrik



More information about the erlang-questions mailing list