[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings
Richard O'Keefe
ok@REDACTED
Tue Oct 25 06:26:26 CEST 2011
>> Cookbook Question:
>>
>> I have often seen the words "UTF-8 string" used in sentences like
>> "Java has UTF-8 strings". What does this mean when applied to Erlang?
Minor note: it means very little when applied to Java and less of that
is actually true.
- Java *source* code, including strings, may be encoded in various
ways, including UTF-8.
- The String *class* in Java is *NOT* based on UTF-8 but on UTF-16.
- There is no reason in principle why Java _couldn't_ have a UTF-8
string type, but there doesn't happen to be one in java.lang.*
or java.util.* (I rolled my own Latin1 string class for some
tasks.)
>> In Erlang strings are syntactic sugar for "lists of integers"
In Java strings are syntactic sugar for "slices of arrays of 16-bit integers".
>>
>> Imagine the string "10(Euro)" - (Euro) is the glyph representing the
>> Euro currency symbol.
>>
>> The term "UF8-string" representing "10(euro)" in Erlang could
>> mean one of two things:
>>
>> Either a) [49,48,8364] (ie its a list of three unicode
>> integers)
For "integer" read "codepoint".
Dealing with Unicode *codepoints* in Erlang is tolerably straightforward;
the difficulties are
- dealing with external *encodings*
- dealing with the *semantics* of Unicode, like the way that
different sequences of codepoints may represent the same
sequence of characters, so that list equality and string equality
are arguably different things.
More information about the erlang-questions
mailing list