[erlang-questions] Atom Unicode Support

José Valim jose.valim@REDACTED
Sat Jan 30 21:04:19 CET 2016


Hello everyone,

Back in 2012, the OTP team decided to improve Unicode Support in source
files
<http://webcache.googleusercontent.com/search?q=cache:35XTts_7luUJ:www.erlang.org/news/year/2012+&cd=2&hl=en&ct=clnk&gl=pl>.
Since I couldn't find a non-cached link (sorry), I will paste some bits
here for convenience:

> The default file encoding will be ISO-Latin-1 in R16, but will be changed
to UTF-8 in R17. (...) Source code will need no change in R16, but adding a
comment denoting ISO-Latin-1 encoding will ensure that the code can be
compiled with the R17 compiler. Adding a comment denoting UTF-8 encoding
will allow for Unicode characters with code points > 255 in string and
character literals in R16. The same comment will allow for atoms containing
any Unicode code point in R18. From this follows that function names also
can contain any Unicode code point in R18.

There was a lot of progress and most of those changes were implemented in
R16 and R17, which are excellent! From my understanding, there is also
runtime support for UTF-8 encoded atoms as well:

1> erlang:binary_to_term(<<131,100,0,4,"josé">>).
josé
2> erlang:binary_to_term(<<131,118,0,5,"josé"/utf8>>).
josé

Even more interesting, those two atoms are equal because the VM translates
the UTF-8 encoded one to latin when possible:

3> v(1) == v(2).

true


Atoms that cannot be translated to latin1 also work:

4>
erlang:binary_to_term(<<131,118,0,9,227,131,142,227,130,175,227,130,185>>).
'ノクス'
5> erlang:atom_to_binary(v(1), utf8).
<<227,131,142,227,130,175,227,130,185>>


So while most of the runtime support just works™,
erlang:binary_to_atom(Binary, utf8) still has the restriction of not
supporting UTF-8 binaries with codepoints more than 255 (they are always
encoded as latin). I have also tried to compile code using UTF8-encoded
atoms, like above, but the compiler chain complained.

Anthony Ramine also noticed the current beam format wouldn't allow UTF8
encoded atoms, we would need to introduce a new chunk for hosting those
(which wouldn't break older beam files).

With all that said, are there any plans of supporting UTF-8 encoded atoms
on Erlang R19? If the feature is desired but not planned, I would love to
contribute the compiler and bytecode changes above although I will likely
need some guidance. If that is an option, I would love to get in touch.

Thank you!

*José Valim*
www.plataformatec.com.br
Skype: jv.ptec
Founder and Director of R&D
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160130/1bcc8e60/attachment.htm>


More information about the erlang-questions mailing list