[erlang-questions] Atom Unicode Support

Pierre Fenoll <>
Wed Feb 3 14:52:15 CET 2016


What about re.erl character classes?

I believe the regular expression [\s] does not match Unicode spaces, even when giving the unicode atom flag to re.erl functions. 

And there are other classes that Unicode defines that would be great for re.erl to support. 

> On 03 Feb 2016, at 14:20, Fred Hebert <> wrote:
> 
>> On 02/03, Max Lapshin wrote:
>> This is one single atom:    my package
>> It is not 2 words, it is a single word that has non-breakable space inside
>> it. Good luck for debugging =)
> 
> I believe the non-breakable space character is still in the space, separator [Zs] character class and should ideally be handled as such by a utf8-aware compiler. So the same way you'd need to type 'my package' (with a regular space), you'd need to type 'my package' (with a nbsp).
> 
> If this isn't respected, I'd probably expect this to be a language problem, not a unicode issue.
> 
> I believe Erlang 17 and earlier would complain about invalid syntax there. Starting in 18, such characters are seen as valid spaces in a program and just go through directly the way a regular space does.
> 
>> 
>> Of course you may say me: hire programmer that makes such things. Ok, no
>> problems. But what to do with copy-paste from skype/slack, where such
>> symbols are translated into nice utf8 automatically?
> 
> Because I am a French-speaking user and non-breakable spaces have their place in regular usage. For example, : takes a leading narrow non-breakable space, and that space must be there while keeping the punctiation mark on the same line as its leading word.
> 
> I have my editor set to hilight such leading spaces with a special character:
> 
>   set list listchars=tab:»·,trail:·,nbsp:·
> 
> All tabs, trailing spaces, and nonbreakable white space characters will show up in text. so 'my package' actually shows up as 'my·package' here, with some hilight color to make sure it's not just the literal '·'
> 
> So assuming the code does not work properly, and that you are one of these programmers working with these characters on a day-to-day basis, there are still ways to work around it without confusion.
> 
> That of course ignores specially crafted code built with the sole intention of confusing people (such as using the greek Α rather than the [whatever your locale] A in function or variable identifiers).
> 
>> It is very good that we all have about 80-90 symbols to write code that
>> other people understand, but I really don't understand what is the profit
>> of adding ability to make code non-understandable by people from other
>> cultures.
> 
> You make the assumption that without unicode, Japanese programmers would write code in English rather than transliterating it in a latin alphabet (say with ISO 3602) for example. This doesn't happen if the programmer does not know English, or if their target audience (coworkers for example) do not speak English. They just find a very annoying workaround to get their meaning across in the language they feel they should use.
> 
> The reality is that if people feel like writing code in their own language, they will do so. If I'm writing code about an ATM in French, I might use the word 'guichet_automatique' or 'gab' instead of 'atm'. You would still be lost with a latin alphabet and lose all meaning. And the comments may very well be in French too, since they'd be to the attention of French speakers.
> 
> So the benefit is that people can write code in their own native language unhindered, and it won't change anything to your comprehension of code because you likely wouldn't understand it anyway, or wouldn't be the target audience of said code and will in all likelihood just not see it in the first place.
> 
> Regards,
> Fred.
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions


More information about the erlang-questions mailing list