Language change proposal

Sun Nov 2 02:20:59 CET 2003

>>> 1) If somebody gives me software to maintain, I might hit a, say, 
>>> Chinese glyph somewhere. I'd have to download the proper font just 
>>> to be able to look at the sources.

If your lucky they may actually be included with OS, but even if they 
are not, I assume that a decent editor has at least some kind of 
fallback method to display them say as their unicode integer code e.g. 
\2345, which is probably just as in/comprehensible as the chinese glyph.

>> I might also be just a bit tricky to figure out how to write the 
>> glyph/s, if it's something like japanese, chinese or korean.
>
> The software that displays Unicode is supposed to do that for you.

I don't think thats really going to help if you want to figure out how 
to write say a japanese or chinese character on a regular keyboard - 
you can of course always do a copy and paste of the text.

> Actually there are issues that I haven't seen properly handled yet; 
> for example, one Far-East script (Indonesian IIRC) has glyphs that /go 
> around/ their neighbouring glyph.
> Human writing is indeed a strange, aesthetically wonderful but 
> technically over-complicated beast - and Unicode is designed for 
> aesthetics and completeness, not for making life easy on the programs 
> that use it.
>
>>> Unicode also has issues with letter case.
>> Isn't this really a kind of design error/bug/feature in erlang ?
>> While I personally would prefer code to be written in english I don't 
>> see any real problems with using Unicode.
>
> I don't, too - but why use Unicode if you're writing in English 
> anyway? Even 7-bit ASCII is enough. Heck, even the common subset of 
> EBCDIC and ASCII would be enough!

Well some additional symbol (none letter) characters (like +,-,@ and ^) 
might be nice and come in handy as operators in different kinds of 
scientific notations.

>
> > The simplest way would
>> probably be to introduce some kind of standard upper case marker 
>> (character) in the case that there is no upper case version of a 
>> character. Another somewhat more confusing choice would be to require 
>> that functions can only start with upper case Unicode letters 
>> (possibly only the characters supplied in the current erlang 
>> character set).
>
> Too complicated, too much of a burden on the programmer to remember 
> correctly, too much of a burden on the maintainer to interpret 
> correctly.
>
> At least that was my initial reaction. Seeing a concrete example of 
> how this is done elegantly in practice, I might reconsider :-)

(note: I'm assuming we only want non-english identifiers, strings and 
atoms)

In erlang we could require all variables/functions to start with either 
upper case ASCII characters (for compatibility with the current OTP 
libs) if the identifier is in pure ASCII and otherwise start the 
identifier with some special character (to mark them as upper case) say 
@ or some other character that isn't used in the context in which 
function and variable identifiers are used. It could look something 
like this:

%% multiply Ä by
@gånger_två(@Ä) -> @Ä * 2.

%% how to call @gånger_två/1
test() ->
	io:format("~p~n", [ @gånger_två(3) ]).

Strings could no longer be considered to be lists of bytes. I don't 
know if this would be a problem to generalize them to lists of 
integers, but it may be wiser to add a proper (unicode) string type - 
which probably requires some way to distinguish them from the current 
strings, maybe something like: @"my unicode string".

Atoms may require a similar solution if they can contain unicode. 
Functions like atom_to_list/1 will probably still work ok if strings 
remain list of integers, but if a real string type is used it will 
obviously be preferable to have a atom_to_string/1 that returns a 
string, but we will still need to support (at least transitionally) 
atom_to_list/1 for at least, the ASCII subset of atom identifiers.

>
>>> With one exception: it would be very nice if the language allowed 
>>> Unicode within string literals. That's more a question of how to 
>>> integrate binary data into source code well.
>> It might also be useful in comments, if they aren't written in 
>> english - japanese, russian and other languages that have completely 
>> different character sets will be rather tedious to encode in some 
>> kind of ASCII/latin1 version.
>
> Agreed.
> Though the Russians tend to manage somehow - I've been seeing a lot of 
> Russion software lately.

I guess russian isn't that hard as they mostly use different symbols 
for the same letters ("sounds") so it should be a simple mapping to 
latin letters.

> Actually, all the non-Western languages have ways of transliterating 
> to Western script. AFAIK there are even several schemes to choose from 
> for any such language.

The problem with transliterations is that they tend to be lossy.
For example japanese (which I know a little bit) suffers from this, 
while the actual transliteration is straightforward, it gets somewhat 
more difficult to actually read and understand the latinized 
(transliterated) version - as japanese has a fair amount of word that 
sound the same way, but which are spelled with different Kanji (chinese 
symbols), which make them easily distinguishable in japanese writing 
but not with the latin character set.

> Re comment usage: In my book, comments are an integral part of the 
> source code. If a comment isn't necessary to understand the code, it's 
> redundant and should be removed, if it's necessary, it should be 
> written in the same language as the source code.
> From this point of view, there's no need for extra allowances in 
> comments.

This kind of assumes a reasonable fluency in english grammar and 
vocabulary, so while I as well, prefer the comments in english or at 
least in the same language as the code (so that technical terminology 
doesn't get confused), there may be cases where it might be wiser to 
let the programmer write at least the comments in non-english so that 
they can clearly express what they intended with the code.