[erlang-questions] json to map
Richard A. O'Keefe
ok@REDACTED
Thu Aug 27 04:41:04 CEST 2015
On 26/08/2015, at 5:56 pm, Roelof Wobben <r.wobben@REDACTED> wrote:
> the exact text of the challenge is here :
>
> Configuration files can be conveniently represented as JSON terms.
Yuck. This has "representation" backwards.
Here's what we have, using [thing] for "things"
and (action) for "processes".
In our heads In our processes In our file system
[abstract [stored
configuration -(implement) -> configuration
value] data]
| |
(conversion) (programmed conversion)
| |
v |
[abstract v
JSON value] -(implement) -> [stored JSON value]
| |
v |
(unparse) (programmed unparsing)
| |
[abstract token v
sequence] -(implement) -> [stored token sequence]
| |
(layout + unlex) (programmed layout + unlex)
| |
v |
[abstract character v
sequence] -(implement) -> [stored character sequence]
| |
(Unicode encoding) (programmed encoding)
| |
v |
[abstract byte v
sequence] -(implement) -> [stored byte sequence]
| |
(compress, (programmed compression.
encrypt, encryption, signing,
sign, &c) and so on)
| |
v v
[another abstract [another stored
byte sequence] byte sequence] ---(store)---> [FILE]
There is an ABSTRACT space of JSON terms.
Each of the arrows (down and right) is a "representation" arrow.
The thing at the tip of the arrow represents the thing at the
base of the arrow. The file we end up with is the thing that
does the representing (GIVEN this framework), and the
configuration data is what is represented.
Don't take "stored" too literally. A "stored" data in the
middle column could be a data structure or a communication
pattern. I just mean that it's "inside the computer" in
the sense that it is directly accessible to code.
This diagram must commute, that is, whatever path you take
through the arrows, you must end up with *equivalent* things.
Not equal.
Converting configuration values to JSON values need not be
unique. For example, a set of n elements might be converted
to a JSON array without duplicates in n! ways. But we can
arrange to treat permuted arrays in certain contexts as
equivalent.
Converting JSON values to token sequences is not unique.
For example, a JSON object doesn't *have* any order to it,
but for unparsing, you have to pick an order. Given an
object with n pairs, there are n! ways to order them.
We can arrange to treat those as equivalent.
Unlexing, converting tokens to character sequences, is not
unique. 1, 1e0, 10e-1, 1.0e1, &c are the same, so even
without allowing leading zeros there are hundreds of
ways (but not infinitely many ways) to represent a number
token. Most unicode characters can be represented in two
ways (/ can be represented in three), so a string of n
characters can be unlexed in at least 2**n ways. (It's
worse than that because \u002f and \u002F are equivalent,
so / has four alternatives.)
Layout can insert arbitrary amounts of white space between tokens,
and there are infinitely many ways to do that.
There are multiple definitions of JSON. ECMA 404 stops at
the level of Unicode character sequences, and has nothing
to say about encoding. There are LOTS of encodings.
There are also many compression, encryption, and digital
signature algorithms, which be freely composed.
JSON qua JSON has nothing to say about how files are encoded
or whether they are compressed, encrypted, or signed. But
to put text into a file, you have to encode it somehow, and
you have to make some decision about other matters. (And
don't get me onto file systems with fixed length records,
where you have to figure out how to fit a 1 million character
string into 128 byte records...
> Write
> some functions to read configuration files containing JSON terms and
> turn them into Erlang maps.
What if a configuration file represents this JSON term:
[["target","some program"],
["source","some other program"],
["date",[2015,08,27,14,05]],
["gibberish",[3,1,4,1,5,9,2,7]]]
How are you supposed to convert *that* to an Erlang map?
In any way that makes sense?
Oh, I know:
{"": <<"[[\"target...7,]]]>>}
or whatever the syntax for maps is.
It technically satisfies the requirements!
The first thing to do with these exercises is CRITICISE them.
I do not mean to sneer at them and throw them away, but to
start from a presupposition that the language is muddled,
the contents confused, and the requirements either incomplete
or inconsistent. (Like practically *every* requirement we
start with including some published standards. I'm looking
at you, ECMA 404!)
I am not kidding. You have to start out by trying to
understand the requirements, EXPECTING to find problems,
RESOLVING them, and writing down REVISED requirements
that spell out everything you actually need to know.
For example, you might include the following:
- Only the UTF-8 encoding is to be supported.
- No compression, encryption, or signing are to be supported.
- You may assume that the file system treats a file as
an arbitrary sequence of bytes with no record boundaries.
- You are to convert null, false, true to the Erlang atoms
'null, 'false', 'true'.
- You are to convert JSON numbers to Erlang floats.
- You are to convert JSON strings to Erlang binaries.
- You are to convert JSON arrays to Erlang lists;
nothing else is to be converted to a list.
- You are to convert JSON objects to Erlang maps;
nothing else is to be converted to a map.
- You are not to worry about inverting the conversion
from configuration data to JSON terms; there is no
configuration data, that was just put in to make it
interesting.
> Write some code to perform sanity checks
> on the data in the configuration files.
Here is another piece of confusion/incompleteness, or
possibly even questionable advice.
This presupposes some procedure where you FIRST convert
a JSON text stored in a file to some Erlang term and
THEN you check the sanity. Or at least, it seems to.
Another approach is to check as you go so that there is
never any insane Erlang data at all.
This is highly topical, because we've recently seen a
bunch of serious Android security bugs caused by
overly trusting object deserialisation which allowed
objects to be constructed violating their invariants.
In fact this has triggered a burst of work on my
Smalltalk system, because I had a great big OOPS:
oh dear, I have the same problem. So I'm now slogging
through nearly a thousand files turning comments
about invariants into executable code and writing
invariants for the *shameful* number of classes that
had none, so that the deserialisation code can call
each newly reconstructed object's #invariant method
before trusting it.
So I strongly recommend validating data as you parse
it, and if a sanity check is failed, crash immediately.
This leaves nothing for subsequent sanity checks to do.
UNLESS you have configuration data that's converted to
JSON terms in such a way that not all terms represent
valid configuration data. But from what you quote,
you haven't been given anything for sanity checks like
that to DO.
All things considerd, the exercise appears to be a
cryptic way of saying "WRITE A JSON PARSER".
For what it's worth, my JSON parser in Smalltalk is
117 lines for a tokeniser + 45 lines for a parser.
Being stricter about the input would let me shave
about 20 lines off the total.
Much of the trickiness is in handling strings,
where JSON requires that a character outside the
Basic Multilingual plane must be encoded as a
surrogate pair.
Processing a sequence of characters as an Erlang
string will probably make your life simpler; and
processing a sequence of tokens as an Erlang list
will also be likely to make your life simpler.
>
More information about the erlang-questions
mailing list