[erlang-questions] json to map

Fri Aug 28 07:21:57 CEST 2015

On 27/08/2015, at 11:04 pm, Roelof Wobben <r.wobben@REDACTED> wrote:

> Thanks,
> 
> Can this be a way to solve the challenge : http://www.evanmiller.org/write-a-template-compiler-for-erlang.html

That link starts by making three claims:

 • Erlang is hard to refactor

   I don't find manual refactoring harder in Erlang than in
   any other language (not excluding Java and Smalltalk).
   I haven't tried Wrangler or RefactorErl (see
   http://plc.inf.elte.hu/erlang/) yet, but they look good.

 • There is no built-in syntax for hash maps

   This is no longer true.

 • String manipulation is hard

   That's a puzzler.  I've found string manipulation using
   lists *easier* in Erlang than in almost anything but SNOBOL
   or Prolog.  I would certainly ***MUCH*** rather write a
   string -> JSON parser in Erlang than in say Java or even
   Lisp.  (Of course the Bigloo implementation of Scheme has
   special support for lexers and parsers built in, which does
   change the picture.)

   The question is always "compared with WHAT?"  In many case
   the key trick for manipulating strings is DON'T.  My JSON
   parser in Smalltalk, for example, is only concerned with
   strings to the extent that they are a nasty problem posed
   by JSON that it has to solve; they are not something that
   it uses for its own purposes.  The tokeniser converts a
   stream of characters to a stream of tokens, and the parser
   works with tokens, not characters.  (Yes, I know about
   scannerless parsers, but the factoring has always helped me
   to get a parser working.  A separate tokeniser is something
   that I can *TEST* without having to have the rest of the
   parser working.)

Then it turns out that the web page is really about writing
a compiler from "Django Template Language" to Erlang.
"It helps to get a hold of a language specification if there
is one. I am implementing the Django Template Language.  There's
not really a spec, but there is an official implementation in Python,"

OUCH!  What *IS* it about this industry?  Why do we get notations
that become popular where there is no spec (like Markdown,
originally, or JSON, ditto -- it had syntax but no semantics)
or the spec is confused (like XML, where they muddled up
syntax and semantics so that we ended up with several different
semantics for XML, or the first version of RDF, where they
meant to define it in terms of XML semantics, but there wasn't
really one, so they defined it in terms of XML syntax *by mistake*).

That page talks about writing a scanner with an argument to
say what the state is.  This is almost always a bad idea.
Each state should be modelled by a separate Erlang function.

Let's see an example of this.
Let's consider dates written in one of four ways:
    dd/mm/yyyy
    dd MON yyyy
    MON dd[,] yyyy
    yyyy-mm-dd

(By the way, we give matching and cleaning up data that's just
a little bit more complex than this as an exercise to 3rd year
students.  Thinking in Java makes it *impossible* for them to
get something like this right in a 2-hour lab session. 
Regular expressions are a royal road to ruin.)

I'll do this in Haskell.

data Token
   = TInt Int
   | TWord String
   | TSlash
   | TDash
   | TComma

tokens :: [Char] -> [Token]

tokens [] = []
tokens (c:cs) | isSpace c = tokens cs
tokens (c:cs) | isDigit c = digits cs (ord c - ord '0')
tokens (c:cs) | isAlpha c = word   cs [c]
tokens ('/':cs) = TSlash : tokens cs
tokens ('-':cs) = TDash  : tokens cs
tokens (',':cs) = TComma : tokens cs
-- anything else will crash

digits (c:cs) n | isDigit c = digits cs (ord c - ord '0' + n*10) : digits cs
digits cs     n             = TInt : tokens cs

word (c:cs) w | isAlpha c = word cs (toLower c : w)
word cs     w             = TWord (reverse w) : tokens cs

Converting the tokeniser to Erlang is a trivial exercise for
the reader.

valid_month :: String -> Int
valid_month "jan"      = 1
valid_month "january"  = 1
...
valid_month "december" = 12
-- anything else will crash

string_to_date :: [Char] -> (Int,Int,Int)

string_to_date cs =
  case tokens cs of
    [TInt d,TSlash,TInt m,TSlash,TInt y] -> check y m d
    [TInt y,TDash, TInt m,TDash, TInt d] -> check y m d
    [TInt d,TWord m,TInt y]              -> check y (valid_month m) d
    [TWord m,TInt d,TComma,TInt y]       -> check y (valid_month m) d
    [TWord m,TInt d,       TInt y]       -> check y (valid_month m) d
-- anything else will crash

check :: Int -> Int -> Int -> (Int,Int,Int)
-- left as a boring exercise for the reader.

Converting this to Erlang is also a trivial exercise for the reader.

You will notice that there are multiple scanning functions and
no 'what state am I in?' parameter.  Your scanner should KNOW
what state it is in because it knows what function is running.

Yecc is a great tool, but for something like this there's no
real point in it, and even for something like JSON I would
rather not use it.

One thing that Leex and Yecc can do for you
is to help you track source position for reporting
errors.  For a configuration file, it may be sufficient to
just say "Can't parse configuration file X as JSON."

OK, the technique I used above is "recursive descent",
which works brilliantly for LL(k) languages with small k.
But you knew that.
Oh yes, this does mean that writing a parser is just like
writing a lexical analyser, except that you get to use
general recursion.  Again, you typically have (at least)
one function per non-terminal symbol, plus (if your
original specification used extended BNF) one function
per repetition.

Heck.
s expression
  = word
  | "(", [s expression+, [".", s expression]], ")".

data SExpr
   = Word String
   | Cons SExpr SExpr
   | Nil

s_expression :: [Token] -> (SExpr, [Token])

s_expression (TWord w : ts) = (Word w, ts)
s_expression (TLp : TRp : ts) = (Nil, ts)
s_expression (TLp : ts) = s_expr_body ts

s_expr_body (TRp : ts) = (Nil ts)
s_expr_body (TDot : ts) =
   let (e, TRp : ts') = s_expression ts
    in (e, ts')
s_expr_body ts =
   let (f, ts')  = s_expression ts
       (r, ts'') = s_expr_body ts'
    in (f:r, ts'')

This is so close to JSON that handling JSON without
"objects" should now be straightforward.  And it makes
a good development step.