[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Angel J. Alvarez Miguel clist@REDACTED
Mon Oct 24 16:51:21 CEST 2011


On Viernes, 21 de Octubre de 2011 21:28:39 Joe Armstrong escribió:
> On Fri, Oct 21, 2011 at 2:35 PM, Richard Carlsson
> 
> <carlsson.richard@REDACTED> wrote:
> > On 10/21/2011 10:41 AM, Angel J. Alvarez Miguel wrote:
> >> (Im using kate on OpenSSUE 11.4 X64 and erlang/OTP  R14B04 (erts-5.8.5)
> >> and my
> >> sources are in utf-8)
> > 
> > No, don't make this mistake. To the Erlang compiler, your sources are in
> > Latin-1, plain and simple. As far as the compiler knows, you have
> > actually written "ó ñ ü" and nothing else. When you print the string
> > with io:format, you are printing the Latin-1 text "ó ñ ü" (the bytes
> > [195, 179, 32, 195, 177, 32, 195, 188]) to the standard output. That
> > your console re-interprets these bytes as "ó ñ ü" just means that you
> > have managed to fool the system for this particular use case.
> > 
> > (By the way, those characters are already in the Latin-1 charset, so you
> > don't *need* UTF-8 at all unless you have some additional characters you
> > want to use that are above 255 in Unicode.)
> > 
> > If/when Erlang supports other encodings in source code (this will
> > probably require adding a compiler flag for specifying the input
> > encoding), a string literal such as "ᚱ" should be equivalent to [5809],
> > not [225,154,177], just like your "óñü" should be equivalent to
> > [243,241,252] (which is what you would have got if your editor had been
> > set to Latin-1 to begin with).
> > 
> > One can think about it like this: taking an existing, working, Latin-1
> > source file, converting it to UTF-8 (or any other encoding), and
> > compiling it with a flag that informs the compiler what the input
> > encoding is, should not change the semantics of the program in any
> > respect whatsoever compared to compiling the original source file. Thus,
> > a string literal that today contains "ß" ([223]) in a plain old Latin-1
> > encoded Erlang source file must *always* mean [223] no matter what you
> > change the input encoding to.
> > 
> >>> Will "erlc foo.erl" automatically detect that foo.erl is unicode
> >>> encoded and do the right thing when scanning and tokenising strings?
> > 
> > No. Erlang source code is (currently) Latin-1 by definition. No matter
> > what your editor thinks it is using, the compiler will interpret the
> > bytes as Latin-1.
> 
> I hate to say this - but just about the only thing XML got right was
> the declaration
> 
>    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
> 
> Should we have
> 
>    -erlang("1.0","UTF=8","no"). :-)

Good idea!

Maybe we could just write down our strings in the classical way and epp provided that we spec how our files are coded could translate those simple strings over a new type like

{string,[utf-8,plain],"this a new string type"}

(perhaps a proplist instead of just [utf-8,plain])

So classical string functions would pattern matching this new type allowing correct transformations y property calculations in the presence of utf-8 utf-16....

Now you can write:
...
-erlang

myfunc(Args) ->
	...
	Count = string:len("El tamaño no importa!").
	....



Provided I wrote your -erlang(...) clause epp would transform this to

	Cound = string:len({string,[utf-8,ES_es,plain],"El tamaño no importa!"}).

retaining latin-1 compatiblity in the compiler.

That perhaps would allow to empower the basic string type without disturbing existing solutions
and legacy code so you end with a panoplia o faster/simpler to slowest/complex string handling solutions:

binaries --> plain strings (list of integers) --> complex string (tagged values).

/angel


> 
> as the first line :-)
> 
> (( I have argued in vain for a version for years - to allow for
> incompatible changes to
> the syntax ))
> 
> /Joe
> 
> >   /ᚱᛁᚴᚼᛅᚱᛏ
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list