[eeps] JSON

Mon Jul 4 07:41:25 CEST 2011

On 1/07/2011, at 6:05 PM, Paul Davis wrote:
> I think the issue here between our interpretations is that I've not
> yet fully captured Erlang zen.

It's not such much >Erlang< zen as >Software Engineering< zen.

The point here is that the specification is incomplete,
and that there is more than one *reasonable* way to complete it.

Here is the key line from my Smalltalk implementation,
where
    d is the dictionary I'm building,
    k is the current key,
    v is the value that's just been read for it.

	d at: k put: v.

That is a reasonable choice.  But

	d at: k ifAbsentPut: [v]

is *also* a reasonable choice.  In this case I'm building the
dictionary up from left to right.  Another reasonable choice
would be to store the keys and values on stacks, and then at
the closing } do

	d := Dictionary new: keys size.
	keys isEmpty whileFalse: [
	    d at: keys removeLast put: values removeLast].

which would once again have the effect of choosing the first
value for a key, not the last.

I met this particular issue in the mailing list for HTML Tidy,
where the question came up "what to do when an attribute is
repeated in an element?"  It turned out that some browsers
took the first value for a repeated attribute and some took
the last, so that there was no one right thing to do that would
be compatible with *every* browser.

Let me now give you a specific example of this effect.
There is a really fantastic program for statistics/data mining/
bioinformatics/finance/... called R.  It's free, just like Erlang.
It has a huge library.

m% R
> library("rjson")
> x <- fromJSON('{"a":1, "a":2}')
> x
$a
[1] 1

$a
[1] 2
> x$a
[1] 1

So it has returned all the key:value pairs in the input in the order
they appeared in the source (rather like my hypothetical proplist),
but when you go looking for a particular key,

	the value you get is the FIRST, not the last.

So now we have an existence proof for these claims:

	the JSON parsers listed on the www.json.org page
	DO NOT AGREE about whether to retain all values for
	a key or only one of them

	the JSON parsers listed on the www.json.org page
	DO NOT AGREE about which value to return for a repeated
	key when you ask for that key in the result of a parse.

I have neither the time nor the inclination to make an exhaustive
examination of all the parsers on that page; I suspect we'd uncover
some more distinct behaviours.

If you are in the unlucky situation that whatever you do you are
going to be incompatible with *someone*, it's not really kind to
be quiet about that.

> In terms of Python it makes solid sense
> that the last k/v par in byte order of the JSON stream is the winner.
> In Erlang there's no magical dict type that dictates this. I say that
> specifically because there is a dict module *and* a proplist module.
> They do similar things but are not inherit to the language like Python
> dict's.

I cannot parse "are not inherit" unless "inherit" should be "inherent".

Let's just take 'dict'.  There are *two* ways to read a JSON object
using dict, and both of them are natural.
(a) Create an empty dict, and thread it through read key read value
    dict:store/3.
    Do it this way, and you will get the rightmost value for a key.
(b) Create a list of {key,value} pairs, pushing new pairs on in the
    easy place as you go, and at the end call dict:from_list/1.
    Do it this way and you will get the leftmost value for a key.

Whether the data type is built into the syntax of the language or not
is really not terribly relevant.  The thing is that two reasonable people
writing JSON parsers in reasonable ways can reasonably end up disagreeing
about what to do with repeated keys, and no matter what choice we make,
we *WILL* disagree with other JSON parsers out in the wild.

By the way, I cannot agree that taking the last binding "makes solid
sense in Python" if that is meant in an exclusive sense.  Taking the
first binding *ALSO* makes solid sense in Python.  It happens not to
be what {a:1, a:2} does, but then, dict(a=1, a=2) does something different.
dict(a=1, b=2) makes the same thing that {"a":1, "b":2} does;
dict(a=1, a=2) is a syntax error.  So we can certainly argue that
making repeated keys a parse error is fully consistent with the spirit of
Python.

I'm not arguing that we *shouldn't* take the last binding;
I'm arguing that choosing the last binding is not an easy,
an automatic, or an everywhere compatible choice.

>>> The other important bit of this is that other
>>> languages don't complain on repeated keys. JavaScript, Ruby, and
>>> Python all just end up using the last defined value.

As noted above, this is NOT the case in Python if you use
dict(key=val, ...) syntax.  In the Pharo implementation of
Smalltalk,
	Dictionary newFrom: {#a -> 1. #a -> 2}
raises an exception.

>> 
>> In the case of JavaScript, this really does seem to be a property
>> of the *language*:
>> js> var y = {a: 1, a: 2, b: 3, b: 4};
>> js> y.a;
>> 2
>> js> y.b;
>> 4
> 
> I'm not entirely certain what you're disagreeing with here.

Absolutely nothing.

> My point
> was that the right most defined value is what ends up "winning" the
> race. If you consider writing a parser that ends up parsing out a key
> and value, and then just does something like "obj.store(k, v)" and
> doesn't check that it overwrote something, the behavior is
> unsurprising.

And my general point is that there are *other* equally natural ways
of parsing JSON that have *different* outcomes, *whatever* the host
hash tables do.  I'm not saying that taking the rightmost binding
is *surprising*, just that it isn't *universal*.

> 
>> In the case of Ruby and Python, it's a property of a library, not
>> of the language.  The 'json' module that comes with python does
>> this.  But it also does things that I regard as undesirable:
>>        json.loads('{a:1}')
>> dies horribly with a somewhat misleading error message.  Nothing
>> stops there being other JSON parsers for Python (and there are),
>> and nothing stops them making another choice.
>> 
> 
> I'm also not sure here. The example you provide is '{a:1}' which is
> invalid JSON because the key is not a proper string.

I know.  But it is perfectly legal Javascript, and it is perfectly
obvious what it means, and there are other JSON parsers that accept it.
More importantly, I've seen alleged "JSON" data that used unquoted
keys.

Why strain out the gnat of unquoted keys and swallow the camel of
duplicate keys?  If you want ambiguous illegal JSON quietly accepted,
why not want unambiguous illegal JSON quietly extended too?

Why is it important to accept JSON data with duplicate keys?
How common is such data?
On what grounds can you be sure that the originator intended the
rightmost binding to be used?  (If so, why did they put a duplicate
there in the first place?)

> Interestingly
> enough, the Ruby JSON parser seems to insist on a top level array or
> object that neither JavaScript or Python do.

The JSON specification says that a top level thing must be an array
or an object; presumably so that it will be self-delimiting, although
in that case I would have thought that strings would be OK.

R's rjson library also allows top level numbers, keywords, and strings.

> Something I should've mentioned in my earlier email is that most
> definitely the EEP should in under no circumstance "assume that data
> originated from Erlang".

I never said any such thing.  However, it is fair enough to assume
that they data *MIGHT* have originated from Erlang (or R, or C).
> 
>> I can see two ways around this.
>> 
>> (1) Follow the herd, and quietly take the last value for a repeated key.
>>    I have to admit that the JSON parser I wrote in Smalltalk does
>>    exactly this.

Now that I know that the ISN'T any unanimity, it is less clear to me than
ever that we should do this.

>> (2) Add another option {repeats,first|last|error} with default last.
>>    This would be my preference.
> 
> My only experience is CouchDB related. We pretty much ignore repeated
> keys.

But you cannot ignore them.  You have to do *something* with them.
Take the first, take the last, keep them all, none of these counts as
*ignoring* repeated keys.

> If someone sends us data that has them, we (more or less) store
> it and will send it back. We're bit of an awkward case because we
> reserve a key namespace to do this. I don't really have a solid
> argument on the repeats issue other than to ask "how do we define this
> given that no one else seems to care".

The requirement in RFC 4627 is just that "names within an object SHOULD
be unique", but "SHOULD" means (RFC 2119) that this rule may be broken
if someone thinks they have a really good reason and have thought about
it for a bit.  I've just sent e-mail to Douglas Crockford asking for
clarification.

You see one thing strikes me about CouchDB or any other JSON database.
There are two draft proposals,
   http://tools.ietf.org/html/draft-pbryan-zyp-json-pointer-00
     JSON pointers.
   http://tools.ietf.org/html/draft-pbryan-json-patch-01
     JSON patches:      
which seem like things that a JSON database might need to support:
"give me this part of this thing" and
"make these changes to these parts of this thing"
and I do not see how they can possibly be expected to work if an object
has duplicate keys.
> I kinda sorta agree. But I would argue that we must absolutely write
> any sort of spec on numbers in terms of contemplating complete
> ignorance of the source of the JSON. The proposition for saying "parse
> these numbers in such and such a way" promote the implicit assumption
> of "I know what this data is".

There are at least two different "styles" of using JSON data.
(1) JSON database: take what you are given and give it back
(2) JSON protocol: here is a message to be acted on, I know exactly
    what I'm supposed to do with it.  (E.g., json-rpc.org.)

If you are using a JSON protocol, you had *BETTER* know what the data are.

> As a core library I don't think that's
> a point of view that should be promoted.

I do not think that a core library should be forcing copying-and-conversion
when the client IS in a position to say what it is expecting.

I note, for example, that the Go library for JSON has as one of its
modes of operation "decode a JSON term into this Go object right here".

>> The *right* answer of course is to use frames (Joe Armstrong's
>> "proper structs"), so it *should* be <{}>.
>> 
> 
> I admit that I haven't worked with Erlang full time. I've probably
> been using it in a hobby capacity for about three years now. Perhaps
> I'm missing something huge but I can't even attempt to say what a
> "frame" or "proper struct" or "<{}>" means. If its the proposed syntax
> for 'Python dicts' (no better reference point) then cool.

Frames/structs have roots older than Python (the appropriate reference
might well be Ait-Kaci's LIFE and psi-terms).  But yes, except immutable
like all Erlang data.  I was taking the opportunity to plug this once
again: it really is time those were in the language.

> Bottom line is that as someone not super familiar with Erlang, I tend
> to think that V =:= term_to_binary(binary_to_term(V)) would evaluate
> to true.

Why would you think that?  It has never been so.

1> V = <<>>.
<<>>
2> V =:= term_to_binary(binary_to_term(V)).
** exception error: bad argument
     in function  binary_to_term/1
        called as binary_to_term(<<>>)
3> 

>  I'm sure there are cases that can be constructed that violate
> that, but in general I expect things going through the external term
> format to come back identical.

That's a DIFFERENT expression:

T =:= binary_to_term(term_to_binary(T)).

> That could surely be a misinterpretation, but in my experience the
> "to_term" connotation denotes something more than an "encoding".

What experience are you talking about here?
The only "to_term" function I can find in stdlib or kernel is
binary_to_term/1, which is not total, and is nothing other than
a decoding.
> 
> An alternative way to put it would be that not everyone knows the
> Prolog history. I see what term_to_binary does, and I wouldn't blame
> people for being misguided by a "term_to_json".

term_to_binary/1 is total, and is nothing more than an encoding.

atom_to_binary/2 is not in principle total, and is nothing more than
an encoding.

atom_to_list/1 is total, and is nothing more than encoding.

binary_to_atom/2 is not total, and is nothing more than a decoding.

binary_to_list/1 is total.
list_to_binary/1 is not.

binary_to_term/2 is not total.

float_to_list/1 is total (on floats), but
list_to_float/1 is not.

integer_to_list/[1,2] are total (on integers), but
list_to_integer/[1,2] are not.

erlang:ref_to_list/1 doesn't have a documented
erlang:list_to_ref/1, and trying it, there doesn't
seem to be an undocumented one either. 

and so it goes.  "_to_" does not, *IN ERLANG*, connote isomorphism
or totality or anything other than some sort of partial one-way
conversion.

>>> For instance, consider pattern matching a term returned from
>>> json:decode. One of the proposals is to convert keys to atoms when
>>> they can be converted. This means your return value might have some
>>> keys as binaries and some as atoms.
>> 
>> And if you have an atom in a pattern, that key will be that atom.
> 
> Sure, but how do you write your pattern matching and guards?

In the usual way.  

> Including
> a "is_atom" guard could cause functions to fail unexpectedly if
> someone sent a unicode string. How do you explain to someone not
> familiar with the situation why a key isn't an atom in that case?

If someone isn't familiar with the situation, why do they want to
know, and would they understand the explanation if offered one?

What I am saying is that IF you write a pattern containing
explicit atoms AND you ask for keys to be converted using
list_to_explicit_atom/1 THEN you won't get any nasty surprises.
Keys that you DON'T mention might arrive as atoms or might
not, but if you knew enough about them to care, you would
mention them somewhere in your program, and there would be no
problem.

Again, if you are thinking of JSON in terms of *protocols*, this is
a good approach.  If you are thinking of JSON in terms of data for
a data base, not so.  But neither of these views should be privileged
over the other, which is why the EEP offers more than one approach.