[erlang-questions] Leex And Character Encodings

Sat Aug 21 10:37:20 CEST 2010

Robby

The front end supports full unicode and the inputs are expressions like in
Excel. The valid expressions consist of:
* functions (hundreds with unicode names)
* operators (+ - * / &)
* types ("quoted unicode strings" 1 true #ERRVAL! {date, time}) - the dates
are matched out of strings
* cell addresses and paths (a1 ./a1 ../../some/path/bob33 /a/path/a2
!a!path!a3)

The paths are lower-cased unicode and can't contain spaces and certain
characters - the cell addresses are latin alphabet based...

The front end sends everything utf-8 so some characters are encoded with 2,
3 or 4 bytes...

Best to just do a round trip set of examples. Stick at expression in the
front-end, store it in the db and it returns both the original expression
and the result of evaluating it to the front-end. This table

Front-End                  Back-End                                  Expr
Returned           Val Returned
åäãâáà                      Ã¥Ã¤Ã£Ã¢Ã¡Ã                       åäãâáà
            åäãâáà
="åäãâá "                  " =\"Ã¥Ã¤Ã£Ã¢Ã¡ \""              ="åäãâá "
          åäãâáà
="åä " & "ãâá"            =\"Ã¥Ã¤ \" & \"Ã£Ã¢Ã¡\""      ="åä " & "ãâá"
  åäãâáà
=1+2                          =1+2
=1+2                       3
= 1 + 2                       = 1 + 2                                      =
1 + 2                    3

The problem comes when I put spaces in the white space:
=  1 +     2                  "=Â  1 +Â Â Â Â  2"                  =  1 +
  2                #ERROR!

The expression round trips fine but (unlike the previous examples) the
server-side expression returns an error for the value because the expression
doesn't match any valid syntax.

Tabs are expanded to white spaces so the only problem (I think) is with
multiple white spaces - which is why I think just adding a lexical token to
make Â the same as 2 spaces would work.

The problem is that it is just fugly :(

Gordon

On 20 August 2010 10:12, Robert Raschke <rtrlists@REDACTED> wrote:

> Hi Gordon,
>
>
> On Fri, Aug 20, 2010 at 8:30 AM, Gordon Guthrie <gordon@REDACTED>wrote:
>
>> I'm hitting some problems with the character encoding for Leex.
>>
>> I have a front end which is submitting proper unicode in utf-8 format and
>> the utf-8 is round-tripping correctly - I submit it from the webpage in a
>> Jquery post, it is processed on the back end and then returned to the
>> front
>> end in utf-8 where it displays correctly...
>>
>> During the back end processing I need to feed it through leex to generate
>> user actions - certain posts contain a domain specific language.
>>
>> The DSL is fairly well specified and strings in it are quoted so they just
>> pass through the lexer in utf-8 and are fine and dandy and we so some
>> processing on them in unicode - by running language parsers over the
>> lexical
>> token stream. The utf-8 just streams through the parser as single
>> character
>> stream and we don't care...
>>
>> The problem is that the white space elements of the DSL get knocked about,
>> so two spaces are turned into Â
>>
>> It seems to me that I can't expect lex to work with utf-8 natively and I
>> just have to suck it up and create a whitespace lexical token that matches
>> Â
>>
>> Or am I just being a fool?
>>
>> Gordon
>>
>> --
>> Gordon Guthrie
>> CEO hypernumbers
>>
>> http://hypernumbers.com
>> t: hypernumbers
>> +44 7776 251669
>>
>
> That doesn't sound right. Which whitespaces are getting trashed? Ones
> between your DSL elements, or ones inside your quoted strings?
>
> I'm assuming that just your quoted strings contain non-7bit-ascii. Or does
> your DSL have elements outside that?
>
> If only the strings have utf-8, then what are you doing to "process" them?
>
> Question, questions,
> Robby
>
> PS You can buy me a pint some evening and bring your code, if you like :-))
>
>

-- 
Gordon Guthrie
CEO hypernumbers

http://hypernumbers.com
t: hypernumbers
+44 7776 251669