[erlang-questions] Fwd: String encoding and character set

Romain Lenglet <>
Wed Jan 17 02:36:45 CET 2007


As Robert explained, the current convention for representing strings in 
Erlang is a flat list of Unicode code-points as integers. Every element 
in such a list is a character, represented by its Unicode code-point 
integer value. The 11th character of a string is the 11th element in the 
list. If you want to encode such a string, you are free to do so, and 
that is relatively easy. But the current convention is to represent 
strings *unencoded*, as such lists of Unicode code points.

One drawback with this convention is that the standard external 
representation is inefficient (e.g. when you send a term in an Erlang 
message) for strings containing characters which code-point is > 255.

Please read the long discussion that we had last summer:
http://www.erlang.org/pipermail/erlang-questions/2006-June/021168.html
http://www.erlang.org/pipermail/erlang-questions/2006-June/021214.html
http://www.erlang.org/pipermail/erlang-questions/2006-June/021215.html
etc. etc.

dda wrote:
> Nope. Let's take for instance a utf-8 string. As an Erlang list,
> there's no way, in the language, to extract safely one character or
> more from the string. You cannot extract, in say "유니코드는ISO엔코딩보다 훨씬
> 좋다." [that's Korean if you're wondering] the 6th to 11th characters –
> ISO엔코딩 – without doing more contorsions than a circus artist. A list
> is not a string, it's raw data left for us to muck with.
> 
> --
> dda
> 
> On 1/17/07, Robert Virding <> wrote:
>> We do actually, in fact we have something much much better, a list.
>> Using a list you don't have to worry about encodings but can use the
>> unicode value directly in the string/list. This makes all processing
>> much easier. Then when you are done you can convert it to what ever
>> encoding you want.
>>
>> I don't really understand why anyone would want to process data in an
>> unnecessarily complex format instead of a simple one.
>>
>> Robert
>>
>> dda wrote:
>>> String types – at least well-implemented ones – don't just store a
>>> string, but also encoding information. They are/should be geared
>>> towards pain-free manipulation of text data, and by text I mean things
>>> outside ASCII-land. Encodings-aware string manipulation functions
>>> don't function on bytes, but on characters, a quite different notion.
>>> We don't have this in Erlang.
> 
> _______________________________________________
> erlang-questions mailing list
> 
> http://www.erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list