[erlang-questions] json to map
Richard A. O'Keefe
ok@REDACTED
Fri Aug 28 07:21:57 CEST 2015
On 27/08/2015, at 11:04 pm, Roelof Wobben <r.wobben@REDACTED> wrote:
> Thanks,
>
> Can this be a way to solve the challenge : http://www.evanmiller.org/write-a-template-compiler-for-erlang.html
That link starts by making three claims:
• Erlang is hard to refactor
I don't find manual refactoring harder in Erlang than in
any other language (not excluding Java and Smalltalk).
I haven't tried Wrangler or RefactorErl (see
http://plc.inf.elte.hu/erlang/) yet, but they look good.
• There is no built-in syntax for hash maps
This is no longer true.
• String manipulation is hard
That's a puzzler. I've found string manipulation using
lists *easier* in Erlang than in almost anything but SNOBOL
or Prolog. I would certainly ***MUCH*** rather write a
string -> JSON parser in Erlang than in say Java or even
Lisp. (Of course the Bigloo implementation of Scheme has
special support for lexers and parsers built in, which does
change the picture.)
The question is always "compared with WHAT?" In many case
the key trick for manipulating strings is DON'T. My JSON
parser in Smalltalk, for example, is only concerned with
strings to the extent that they are a nasty problem posed
by JSON that it has to solve; they are not something that
it uses for its own purposes. The tokeniser converts a
stream of characters to a stream of tokens, and the parser
works with tokens, not characters. (Yes, I know about
scannerless parsers, but the factoring has always helped me
to get a parser working. A separate tokeniser is something
that I can *TEST* without having to have the rest of the
parser working.)
Then it turns out that the web page is really about writing
a compiler from "Django Template Language" to Erlang.
"It helps to get a hold of a language specification if there
is one. I am implementing the Django Template Language. There's
not really a spec, but there is an official implementation in Python,"
OUCH! What *IS* it about this industry? Why do we get notations
that become popular where there is no spec (like Markdown,
originally, or JSON, ditto -- it had syntax but no semantics)
or the spec is confused (like XML, where they muddled up
syntax and semantics so that we ended up with several different
semantics for XML, or the first version of RDF, where they
meant to define it in terms of XML semantics, but there wasn't
really one, so they defined it in terms of XML syntax *by mistake*).
That page talks about writing a scanner with an argument to
say what the state is. This is almost always a bad idea.
Each state should be modelled by a separate Erlang function.
Let's see an example of this.
Let's consider dates written in one of four ways:
dd/mm/yyyy
dd MON yyyy
MON dd[,] yyyy
yyyy-mm-dd
(By the way, we give matching and cleaning up data that's just
a little bit more complex than this as an exercise to 3rd year
students. Thinking in Java makes it *impossible* for them to
get something like this right in a 2-hour lab session.
Regular expressions are a royal road to ruin.)
I'll do this in Haskell.
data Token
= TInt Int
| TWord String
| TSlash
| TDash
| TComma
tokens :: [Char] -> [Token]
tokens [] = []
tokens (c:cs) | isSpace c = tokens cs
tokens (c:cs) | isDigit c = digits cs (ord c - ord '0')
tokens (c:cs) | isAlpha c = word cs [c]
tokens ('/':cs) = TSlash : tokens cs
tokens ('-':cs) = TDash : tokens cs
tokens (',':cs) = TComma : tokens cs
-- anything else will crash
digits (c:cs) n | isDigit c = digits cs (ord c - ord '0' + n*10) : digits cs
digits cs n = TInt : tokens cs
word (c:cs) w | isAlpha c = word cs (toLower c : w)
word cs w = TWord (reverse w) : tokens cs
Converting the tokeniser to Erlang is a trivial exercise for
the reader.
valid_month :: String -> Int
valid_month "jan" = 1
valid_month "january" = 1
...
valid_month "december" = 12
-- anything else will crash
string_to_date :: [Char] -> (Int,Int,Int)
string_to_date cs =
case tokens cs of
[TInt d,TSlash,TInt m,TSlash,TInt y] -> check y m d
[TInt y,TDash, TInt m,TDash, TInt d] -> check y m d
[TInt d,TWord m,TInt y] -> check y (valid_month m) d
[TWord m,TInt d,TComma,TInt y] -> check y (valid_month m) d
[TWord m,TInt d, TInt y] -> check y (valid_month m) d
-- anything else will crash
check :: Int -> Int -> Int -> (Int,Int,Int)
-- left as a boring exercise for the reader.
Converting this to Erlang is also a trivial exercise for the reader.
You will notice that there are multiple scanning functions and
no 'what state am I in?' parameter. Your scanner should KNOW
what state it is in because it knows what function is running.
Yecc is a great tool, but for something like this there's no
real point in it, and even for something like JSON I would
rather not use it.
One thing that Leex and Yecc can do for you
is to help you track source position for reporting
errors. For a configuration file, it may be sufficient to
just say "Can't parse configuration file X as JSON."
OK, the technique I used above is "recursive descent",
which works brilliantly for LL(k) languages with small k.
But you knew that.
Oh yes, this does mean that writing a parser is just like
writing a lexical analyser, except that you get to use
general recursion. Again, you typically have (at least)
one function per non-terminal symbol, plus (if your
original specification used extended BNF) one function
per repetition.
Heck.
s expression
= word
| "(", [s expression+, [".", s expression]], ")".
data SExpr
= Word String
| Cons SExpr SExpr
| Nil
s_expression :: [Token] -> (SExpr, [Token])
s_expression (TWord w : ts) = (Word w, ts)
s_expression (TLp : TRp : ts) = (Nil, ts)
s_expression (TLp : ts) = s_expr_body ts
s_expr_body (TRp : ts) = (Nil ts)
s_expr_body (TDot : ts) =
let (e, TRp : ts') = s_expression ts
in (e, ts')
s_expr_body ts =
let (f, ts') = s_expression ts
(r, ts'') = s_expr_body ts'
in (f:r, ts'')
This is so close to JSON that handling JSON without
"objects" should now be straightforward. And it makes
a good development step.
More information about the erlang-questions
mailing list