[erlang-questions] json to map

Fri Aug 28 11:02:12 CEST 2015

On 28/08/2015, at 8:31 pm, Bengt Kleberg <bengt.kleberg@REDACTED> wrote:

> On 08/28/2015 09:45 AM, Roelof Wobben wrote:
>> I will take the challenge but im stuck at the types part.

By far the easiest way to convert my Haskell sample code
to Erlang is to throw the types completely away, or just
leave them as comments.
>> 
>> so far I have this :
>> 
>> -module(time_parser).
>> 
>> -export([]).
>> 
>> -type token :: tInt()
>>             | tWord()
>>             | tSlash()
>>             | tDash()
>>             | tComma().
>> 
>> -type tint()   :: integer().
>> -type tword()  :: binary().
>> -type tSlash() :: binary().
>> -type tDash()  :: binary().
>> -type tComma() :: binary().

Leaving the omitted () aside, this isn't even CLOSE to a
good translation of the Haskell data type.
This makes tword -- should have been tWord and both of
them should be t_word in idiomatic Erlang -- and tSlash
and tDash and tComma the *same* type.  But the whole point
is to make them DIFFERENT. 

-type token
   :: {int,integer()}
    | {word,string()}  %% NOT binary!
    | '/'
    | '-'
    | ','.

The alternatives MUST be such that they cannot be
confused with one another.

tokens([])                           -> [];
tokens([C|Cs]) when C =< 32          -> tokens(Cs);
tokens([C|Cs]) when $0 =< C, C =< $9 -> digits(Cs, C-$0);
tokens([C|Cs]) when $a =< C, C =< $z -> word(Cs, [C]);
tokens([C|Cs]) when $A =< C, C =< $Z -> word(Cs, [C]);
tokens("/"++Cs)                      -> ['/' | tokens(Cs)];
tokens("-"++Cs)                      -> ['-' | tokens(Cs)];
tokens(","++Cs)                      -> [',' | tokens(Cs)].

Of course this wants to convert the letters to lower case,
and it would be really nice to have standard
is_digit(Codepoint[, Base])
is_lower(Codepoint)
is_upper(Codepoint)
is_alpha(Codepoint)
is_space(Codepoint)
guards.  No, macros are NOT good enough;
-define(is_alpha(C), $a =< ((C) bor 32) =< $z).
was fine for ASCII, but failed dramatically for Latin 1,
and these are the days of Unicode.  It's nearly 9pm, time
to go home.  Maybe I should write an EEP about this.

You might say, well, use regular expressions.
Match letters using the POSIX '[[:alpha:]]' construction.
But what does _that_ rely on, eh?

I have completed the translation of the tokeniser from
Haskell to Erlang, and it is pretty much line for line,
and it works.

4> t:t("Jan 26, 1942").
[{word,"jan"},{int,26},',',{int,1942}]
5> t:t("You need gumboots").
[{word,"you"},{word,"need"},{word,"gumboots"}]
6> t:t("Can you dance the Watusi?").
** exception error: no function clause matching t:tokens("?") (t.erl, line 7)
     in function  t:word/2 (t.erl, line 22)
     in call from t:word/2 (t.erl, line 22)

Here's a curious thought.

The use of [Token | tokens(Cs)] means that the stack builds up
a tower of tokens/1 calls, one per token.  By passing the list
of tokens so far through, this can all be tail calls.

tokens(Cs) ->
    lists:reverse(tokens(Cs, [])).

...
tokens("/"++Cs, Ts) -> tokens(Cs, ['/'|Ts]);
...

But then there's that reversal step.  Not a big deal, BUT
it's no harder to parse JSON backwards than it is to parse
JSON forwards!  (Even if you allow JavaScript comments,
they disappear in tokenising, so the *tokens* can be parsed
backwards easily.)  This is a peculiarity of JSON.  I think
you can pull the same trick with XML: lex it forwards, parse
the token sequence backwards.