[eeps] Re: [erlang-questions] [eeps] EEP 35 "Binary string modules" -- locales (fwd)

Mon Nov 15 15:49:48 CET 2010

Hi,

When I speak about locale's, I do not mean locale's as they are 
implementing in the C standard library, using an environment variable 
magically change the behaviour of text oriented routines.

I suppose your example points out the dreadful behaviour of having locales 
in the environment, combined with programmers assuming there is only one 
locale that matters, namely theirs...

If you had routines where you actually had to spell out the locale, 
instead of where a locale is set in the environment and routines change 
behaviour magically, such errors would not happen. I do not suggest to 
adopt the C standard library use of a LANG environment variable, that kind 
of magic functionality change is not "Erlangish". I mean that Erlang 
should have a notion of locales where you could pass a 
locale-specification to a routine and make it behave accordingly.

The actual programming language Erlang has an implicit locale that I'm
unsure which language, country or region it actually maps to. It's not the
C locale, as iso-8859-1 characters are allowed, it's also not a Swedish
locale (which would maybe be expected), as it uses decimal point. It's
simply the "Erlang" locale. That's OK as long as its program code, but now
we are dealing with program data and I don't think the "Erlang" or "C"
locales provide a good default.

Furthermore, we have no notion of locales in the Erlang standard library, 
not even a consistent default one. Each function dealing with language- or 
locale-specific matters has invented it's own (sometimes not even 
well-defined) semantics.

For example the to_upper and to_lower routines don't handle sharp s, 
rendering them somewhat less useful for German. The numeric conversion 
routines only handle decimal point, although a lot of countries using the 
iso-latin-1 character set also uses decimal comma. I don't say those 
functions are not useful as they are, they are just not as useful as they 
might be given locale support. They are primarilly useful when parsing 
Erlang code, they deal with the "Erlang" locale maybe...

I think we should differentiate between
A) Processing of bytes, a sequence of octets that may or may not 
represent some kind of text, typically routines for this are found 
in 'file', 'binary', port communication (mostly in module 'erlang') etc.
B) Processing of strings, a sequence of characters representing written 
text in any language (or in the case of the limited subset of unicode that 
represents iso-latin-1, text in a western european or anglosaxian 
languages). Typical modules are 'unicode', 're', 'io' and 'string' (except 
for the small part of string that belongs to the next group) 
C) Processing of text, sequence of characters with a meaning in a specific 
language. To work with these, locale information is required. Typical 
examples are of course to_upper and to_lower, but also conversion between 
text and numbers. Other examples are date formatting routines, splitting 
text into words etc.

So, my point is - let's not swipe the domain of locales under the carpet 
by providing limited functionality in the wrong module. Let's instead do 
it right. Therefore I suggested a specific EEP for the locale-specific 
routines, in a text oriented module.

That's the long story of why i don't suggest to let to_upper and friends 
go into the bstring module.

Cheers,
/Patrik

On Mon, 15 Nov 2010, Patrik Nyblom wrote:

> Forwarded from erlang-questions mailing list.
> ---------- Forwarded message ----------
> Date: Fri, 12 Nov 2010 10:11:44 -0400
> From: Christian von Roques <roques@REDACTED>
> To: erlang-questions@REDACTED
> Subject: Re: [erlang-questions] [eeps] EEP 35 "Binary string modules" -- 
> locales
>
> Not all text is meant for human consumption.  I'd even venture so far as
> to say that the overwhelming mass of program generated text is not for
> human consumption, its intended consumers are other programs.  The most
> common locale programs "speak" is the default "C" (also called "POSIX")
> locale.  It is complicated to solve the general problem of supporting
> all human locales.  It is much simpler to just support a default locale.
> Even programs intended to create/consume text for humans often have to
> create/consume text in the C locale as well.
>
> I've been told the anecdote that in the 70s a delegation of IBM compiler
> engineers flew to Germany to proudly demonstrate their new optimizing
> Fortran compiler and all it did was spew gibberish and crash because it
> used the standard routines for reading/writing numbers, which in Germany
> used commas for dots and dots for commas due to the then new locale
> awareness of the OS.  Since then I've been convinced that it is a good
> thing to have two separate sets of functions, one small, simple, and
> fast handling only the default locale and another one huge, complicated,
> and not so fast trying to handle all the intricacies of as many locales
> as feasible.
>
> Therefore I'd like to see to_integer and to_float in bstring, grokking
> numbers in the C locale.  to_lower and to_upper too as long as it's
> documented on which characters they are working on.  They wouldn't even
> need to know if the bstring was iso8859-1 or utf-8 encoded as long as
> they only touch ASCII characters.
>
> I don't think it's practical to see bstring as locale independent.
> Rather bstring should be seen as operating in the default locale.  One
> being able to imagine a locale dependent variant of a function should
> not be ground for omitting the function from bstring.  I can even
> imagine concat(<<"Fuß">>, <<"Ball">>) being expected to result in
> <<"Fussball">> in the DE_de locale.
>
> 	Christian.
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>