[erlang-questions] unicode in string literals

Thu Aug 2 03:50:40 CEST 2012

On 1/08/2012, at 7:30 PM, Vlad Dumitrescu wrote:

> 
>> but why should a module written by someone who wants
>> comments in Māori (note the macron? Latin-4 or Unicode needed)
>> use a module written by someone who wants comments in Swedish?
> 
> Maybe not in the long run, but there will be a (long) transition
> period where legacy code will still be used by new code.

Sorry, my typing mistake here.
What I *meant* to write was "why should a [Māori] module
*NOT* use a [Swedish] one"?  You were saying, or so I thought,
that there should be one project = one encoding, and I was saying
I thought that was too restrictive in practice.
> 
>> The whole point of an -encoding directive is that it is something
>> that syntaxtools should handle; by the time your code gets an AST
>> or a token list, encodings are entirely a thing of the past.
> 
> Yes, but I am one of the guys that is going to write some of the tools
> that will handle this conversion, so I do care about the details.

And by the time it gets to you, there won't *be* any details to care about.
> 
>> SWI Prolog actually lets you change the encoding within a file,
>> which sounds crazy but maybe Jan wanted the machinery to be there
>> in case someone wanted ISO 2022 support.  (Because that's basically
>> what 2022 *is*: switching encoding aspects on the fly.)
> 
> Are there any editors that can load/save a file with mixed encodings like that?

I have no idea.  There are a number of editors that claim to support
ISO 2022, which does mid-stream code switching, so they could presumably
be extended to support this.  See for example
	A model for input and output of multilingual text in a windowing environment
	by Yutaka Kataoka, Masato Morisaki, Hiroshi Kuribayashi, and Hiroyoshi Ohara
	ACM Transactions on Information Systems (TOIS)
	Volume 10 Issue 4, Oct. 1992 
> 
> I am still a little worried about two things:
> - debugging a remote system that has different locale
> - reading logs created by modules that have different encodings (some
> modules might be legacy and not be aware that the world is not Latin-1
> anymore).

Ouch.  And then there are all those documents that lie about the
encoding they're using.  (Web pages claiming Latin 1 but being CP 1252
does not exhaust the possibilities.)