[eeps] Re: [erlang-questions] [eeps] EEP 35 "Binary string modules" -- locales (fwd)
Patrik Nyblom
pan@REDACTED
Mon Nov 15 15:49:48 CET 2010
Hi,
When I speak about locale's, I do not mean locale's as they are
implementing in the C standard library, using an environment variable
magically change the behaviour of text oriented routines.
I suppose your example points out the dreadful behaviour of having locales
in the environment, combined with programmers assuming there is only one
locale that matters, namely theirs...
If you had routines where you actually had to spell out the locale,
instead of where a locale is set in the environment and routines change
behaviour magically, such errors would not happen. I do not suggest to
adopt the C standard library use of a LANG environment variable, that kind
of magic functionality change is not "Erlangish". I mean that Erlang
should have a notion of locales where you could pass a
locale-specification to a routine and make it behave accordingly.
The actual programming language Erlang has an implicit locale that I'm
unsure which language, country or region it actually maps to. It's not the
C locale, as iso-8859-1 characters are allowed, it's also not a Swedish
locale (which would maybe be expected), as it uses decimal point. It's
simply the "Erlang" locale. That's OK as long as its program code, but now
we are dealing with program data and I don't think the "Erlang" or "C"
locales provide a good default.
Furthermore, we have no notion of locales in the Erlang standard library,
not even a consistent default one. Each function dealing with language- or
locale-specific matters has invented it's own (sometimes not even
well-defined) semantics.
For example the to_upper and to_lower routines don't handle sharp s,
rendering them somewhat less useful for German. The numeric conversion
routines only handle decimal point, although a lot of countries using the
iso-latin-1 character set also uses decimal comma. I don't say those
functions are not useful as they are, they are just not as useful as they
might be given locale support. They are primarilly useful when parsing
Erlang code, they deal with the "Erlang" locale maybe...
I think we should differentiate between
A) Processing of bytes, a sequence of octets that may or may not
represent some kind of text, typically routines for this are found
in 'file', 'binary', port communication (mostly in module 'erlang') etc.
B) Processing of strings, a sequence of characters representing written
text in any language (or in the case of the limited subset of unicode that
represents iso-latin-1, text in a western european or anglosaxian
languages). Typical modules are 'unicode', 're', 'io' and 'string' (except
for the small part of string that belongs to the next group)
C) Processing of text, sequence of characters with a meaning in a specific
language. To work with these, locale information is required. Typical
examples are of course to_upper and to_lower, but also conversion between
text and numbers. Other examples are date formatting routines, splitting
text into words etc.
So, my point is - let's not swipe the domain of locales under the carpet
by providing limited functionality in the wrong module. Let's instead do
it right. Therefore I suggested a specific EEP for the locale-specific
routines, in a text oriented module.
That's the long story of why i don't suggest to let to_upper and friends
go into the bstring module.
Cheers,
/Patrik
On Mon, 15 Nov 2010, Patrik Nyblom wrote:
> Forwarded from erlang-questions mailing list.
> ---------- Forwarded message ----------
> Date: Fri, 12 Nov 2010 10:11:44 -0400
> From: Christian von Roques <roques@REDACTED>
> To: erlang-questions@REDACTED
> Subject: Re: [erlang-questions] [eeps] EEP 35 "Binary string modules" --
> locales
>
> Not all text is meant for human consumption. I'd even venture so far as
> to say that the overwhelming mass of program generated text is not for
> human consumption, its intended consumers are other programs. The most
> common locale programs "speak" is the default "C" (also called "POSIX")
> locale. It is complicated to solve the general problem of supporting
> all human locales. It is much simpler to just support a default locale.
> Even programs intended to create/consume text for humans often have to
> create/consume text in the C locale as well.
>
> I've been told the anecdote that in the 70s a delegation of IBM compiler
> engineers flew to Germany to proudly demonstrate their new optimizing
> Fortran compiler and all it did was spew gibberish and crash because it
> used the standard routines for reading/writing numbers, which in Germany
> used commas for dots and dots for commas due to the then new locale
> awareness of the OS. Since then I've been convinced that it is a good
> thing to have two separate sets of functions, one small, simple, and
> fast handling only the default locale and another one huge, complicated,
> and not so fast trying to handle all the intricacies of as many locales
> as feasible.
>
> Therefore I'd like to see to_integer and to_float in bstring, grokking
> numbers in the C locale. to_lower and to_upper too as long as it's
> documented on which characters they are working on. They wouldn't even
> need to know if the bstring was iso8859-1 or utf-8 encoded as long as
> they only touch ASCII characters.
>
> I don't think it's practical to see bstring as locale independent.
> Rather bstring should be seen as operating in the default locale. One
> being able to imagine a locale dependent variant of a function should
> not be ground for omitting the function from bstring. I can even
> imagine concat(<<"Fuß">>, <<"Ball">>) being expected to result in
> <<"Fussball">> in the DE_de locale.
>
> Christian.
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>
More information about the eeps
mailing list