[erlang-questions] Leex And Character Encodings

Mon Aug 23 00:41:11 CEST 2010

On Aug 21, 2010, at 8:37 PM, Gordon Guthrie wrote:
> The problem comes when I put spaces in the white space:
> =  1 +     2                  "=Â  1 +Â Â Â Â  2"                  =  1 +
>  2                #ERROR!
> 
> The expression round trips fine but (unlike the previous examples) the
> server-side expression returns an error for the value because the expression
> doesn't match any valid syntax.
> 
> Tabs are expanded to white spaces so the only problem (I think) is with
> multiple white spaces - which is why I think just adding a lexical token to
> make Â the same as 2 spaces would work.

It's not clear to me what precisely is mangling the spaces.
What _is_ clear is that "Â " is precisely what you see when
the Latin-1 No-Break-Space is first converted to UTF-8 and
then displayed by something expecting Latin-1.

1.  How do no-break-space   characters turn up?
2.  What is it that is rendering them as if they were encoded
    in Latin-1 rather than UTF-8?
3.  In any case, if you are going to hack it, you should make
    the 16#C2,16#20 sequence equivalent to ONE space, not two.