[eeps] JSON

Mon Jul 4 19:39:28 CEST 2011

On Mon, Jul 4, 2011 at 1:41 AM, Richard O'Keefe <ok@REDACTED> wrote:
>
> On 1/07/2011, at 6:05 PM, Paul Davis wrote:
>> I think the issue here between our interpretations is that I've not
>> yet fully captured Erlang zen.
>
> It's not such much >Erlang< zen as >Software Engineering< zen.
>

Heh.

> The point here is that the specification is incomplete,
> and that there is more than one *reasonable* way to complete it.
>
> Here is the key line from my Smalltalk implementation,
> where
>    d is the dictionary I'm building,
>    k is the current key,
>    v is the value that's just been read for it.
>
>        d at: k put: v.
>
> That is a reasonable choice.  But
>
>        d at: k ifAbsentPut: [v]
>
> is *also* a reasonable choice.  In this case I'm building the
> dictionary up from left to right.  Another reasonable choice
> would be to store the keys and values on stacks, and then at
> the closing } do
>
>        d := Dictionary new: keys size.
>        keys isEmpty whileFalse: [
>            d at: keys removeLast put: values removeLast].
>
> which would once again have the effect of choosing the first
> value for a key, not the last.
>
> I met this particular issue in the mailing list for HTML Tidy,
> where the question came up "what to do when an attribute is
> repeated in an element?"  It turned out that some browsers
> took the first value for a repeated attribute and some took
> the last, so that there was no one right thing to do that would
> be compatible with *every* browser.
>
> Let me now give you a specific example of this effect.
> There is a really fantastic program for statistics/data mining/
> bioinformatics/finance/... called R.  It's free, just like Erlang.
> It has a huge library.
>
> m% R
>> library("rjson")
>> x <- fromJSON('{"a":1, "a":2}')
>> x
> $a
> [1] 1
>
> $a
> [1] 2
>> x$a
> [1] 1
>
> So it has returned all the key:value pairs in the input in the order
> they appeared in the source (rather like my hypothetical proplist),
> but when you go looking for a particular key,
>
>        the value you get is the FIRST, not the last.
>
> So now we have an existence proof for these claims:
>
>        the JSON parsers listed on the www.json.org page
>        DO NOT AGREE about whether to retain all values for
>        a key or only one of them
>
>        the JSON parsers listed on the www.json.org page
>        DO NOT AGREE about which value to return for a repeated
>        key when you ask for that key in the result of a parse.
>
> I have neither the time nor the inclination to make an exhaustive
> examination of all the parsers on that page; I suspect we'd uncover
> some more distinct behaviours.
>
> If you are in the unlucky situation that whatever you do you are
> going to be incompatible with *someone*, it's not really kind to
> be quiet about that.
>

So, R is weird and other languages may or may not use conditional puts
to their hash/dict/object structure of choice.

My point about *Erlang* zen is that I *rarely* use anything that has
this behavior other than ets. I know the dict module exists but I
treat it like the annoying cousin at Christmas. The proplists module
which seems to be much more ubiquitous in Erlang code is the data
structure that all Erlang decoders have adopted. Seeing as that the
proplist has no behavior forbidding repeated keys, there is no
motivation to deal with them explicitly in my experience.

>> In terms of Python it makes solid sense
>> that the last k/v par in byte order of the JSON stream is the winner.
>> In Erlang there's no magical dict type that dictates this. I say that
>> specifically because there is a dict module *and* a proplist module.
>> They do similar things but are not inherit to the language like Python
>> dict's.
>
> I cannot parse "are not inherit" unless "inherit" should be "inherent".

Yep, meant inherent.

>
> Let's just take 'dict'.  There are *two* ways to read a JSON object
> using dict, and both of them are natural.
> (a) Create an empty dict, and thread it through read key read value
>    dict:store/3.
>    Do it this way, and you will get the rightmost value for a key.
> (b) Create a list of {key,value} pairs, pushing new pairs on in the
>    easy place as you go, and at the end call dict:from_list/1.
>    Do it this way and you will get the leftmost value for a key.
>
> Whether the data type is built into the syntax of the language or not
> is really not terribly relevant.  The thing is that two reasonable people
> writing JSON parsers in reasonable ways can reasonably end up disagreeing
> about what to do with repeated keys, and no matter what choice we make,
> we *WILL* disagree with other JSON parsers out in the wild.
>

I would argue whole heartedly that it most definitely is relevant. For
instance, take the Python dict that you seem to be familiar with.
Python code uses this *extensively*. I'm not even sure I could write
"Hello, World!" without using a dict in Python (j/k). On the other
hand, I haven't read *that* much Erlang, but the dict module is just
not something I see a lot. Proplists seem to be generally preferred
over dicts. Seeing as we have them, and they place no restriction on
key uniqueness, why should we worry about it?

> By the way, I cannot agree that taking the last binding "makes solid
> sense in Python" if that is meant in an exclusive sense.  Taking the
> first binding *ALSO* makes solid sense in Python.  It happens not to
> be what {a:1, a:2} does, but then, dict(a=1, a=2) does something different.
> dict(a=1, b=2) makes the same thing that {"a":1, "b":2} does;
> dict(a=1, a=2) is a syntax error.  So we can certainly argue that
> making repeated keys a parse error is fully consistent with the spirit of
> Python.
>
> I'm not arguing that we *shouldn't* take the last binding;
> I'm arguing that choosing the last binding is not an easy,
> an automatic, or an everywhere compatible choice.
>
>
>>>> The other important bit of this is that other
>>>> languages don't complain on repeated keys. JavaScript, Ruby, and
>>>> Python all just end up using the last defined value.
>
> As noted above, this is NOT the case in Python if you use
> dict(key=val, ...) syntax.  In the Pharo implementation of
> Smalltalk,
>        Dictionary newFrom: {#a -> 1. #a -> 2}
> raises an exception.
>

I'm sorry. I meant to say "taking the last key is the most obvious
choice for that particular parser". Having written JSON parsers, much
Python, and some Python extension modules, this is the most obvious
behavior because the "ifAbsentPut" method is not part of the C API.
Choosing a different behavior would require extra effort.

I would also reiterate what you pointed out, the Python *syntax*
rejects repeated keyword arguments. But this still works fine in
Python:

>>> print {"a": 1, "a": 2}
{'a': 2}
>>> print {"a": 1, "a": 3}
{'a': 3}
>>> print {"a": 1, "a": 0}
{'a': 0}
>>> print {"a": 1, "a": -1}
{'a': -1}

As does this:

>>> dict(("a", 1) for i in range(3))
{'a': 1}
>>> dict(("a", i) for i in range(3))
{'a': 2}

My point about Python is that the most natural way to express a JSON
object is with a dict that is part of the language. This type does not
allow repeated keys. Its quite possible that someone might argue about
which of a repeated set to take, it really doesn't make sense to
specify anything other than the last occurrence (from the point of
view of the implementor).

On the other hand, Erlang (IMHO) prefers the proplists module which
does not make a statement about repeated keys. Thus the most natural
thing to do (IMHO) is to just keep the repeated keys.

>>>
>>> In the case of JavaScript, this really does seem to be a property
>>> of the *language*:
>>> js> var y = {a: 1, a: 2, b: 3, b: 4};
>>> js> y.a;
>>> 2
>>> js> y.b;
>>> 4
>>
>> I'm not entirely certain what you're disagreeing with here.
>
> Absolutely nothing.
>
>> My point
>> was that the right most defined value is what ends up "winning" the
>> race. If you consider writing a parser that ends up parsing out a key
>> and value, and then just does something like "obj.store(k, v)" and
>> doesn't check that it overwrote something, the behavior is
>> unsurprising.
>
> And my general point is that there are *other* equally natural ways
> of parsing JSON that have *different* outcomes, *whatever* the host
> hash tables do.  I'm not saying that taking the rightmost binding
> is *surprising*, just that it isn't *universal*.
>

Certainly.

>>
>>> In the case of Ruby and Python, it's a property of a library, not
>>> of the language.  The 'json' module that comes with python does
>>> this.  But it also does things that I regard as undesirable:
>>>        json.loads('{a:1}')
>>> dies horribly with a somewhat misleading error message.  Nothing
>>> stops there being other JSON parsers for Python (and there are),
>>> and nothing stops them making another choice.
>>>
>>
>> I'm also not sure here. The example you provide is '{a:1}' which is
>> invalid JSON because the key is not a proper string.
>
> I know.  But it is perfectly legal Javascript, and it is perfectly
> obvious what it means, and there are other JSON parsers that accept it.
> More importantly, I've seen alleged "JSON" data that used unquoted
> keys.
>

JavaScript object literals are not JSON. No one has actually seen this
type of JSON. Its like saying "I've seen a black and white striped
horse".

> Why strain out the gnat of unquoted keys and swallow the camel of
> duplicate keys?  If you want ambiguous illegal JSON quietly accepted,
> why not want unambiguous illegal JSON quietly extended too?
>
> Why is it important to accept JSON data with duplicate keys?
> How common is such data?
> On what grounds can you be sure that the originator intended the
> rightmost binding to be used?  (If so, why did they put a duplicate
> there in the first place?)
>

No. You're twisting the spec here. The spec specifically relaxes the
constraint on whether keys are unique. Thus we are well within our
means to not enforce uniqueness. On the other hand, the spec
specifically says that keys are quoted.

More importantly, Python, Ruby, and even JavaScript's native JSON
parser reject unquoted keys.

>> Interestingly
>> enough, the Ruby JSON parser seems to insist on a top level array or
>> object that neither JavaScript or Python do.
>
> The JSON specification says that a top level thing must be an array
> or an object; presumably so that it will be self-delimiting, although
> in that case I would have thought that strings would be OK.
>
> R's rjson library also allows top level numbers, keywords, and strings.
>

As does Python and (once again) even JavaScript's native JSON parser.

>> Something I should've mentioned in my earlier email is that most
>> definitely the EEP should in under no circumstance "assume that data
>> originated from Erlang".
>
> I never said any such thing.  However, it is fair enough to assume
> that they data *MIGHT* have originated from Erlang (or R, or C).
>>
>>> I can see two ways around this.
>>>
>>> (1) Follow the herd, and quietly take the last value for a repeated key.
>>>    I have to admit that the JSON parser I wrote in Smalltalk does
>>>    exactly this.
>
> Now that I know that the ISN'T any unanimity, it is less clear to me than
> ever that we should do this.

In the end it really doesn't matter. Simply because everyone else does
things slightly differently we are not bound to an expected behavior
here. If no one can depend on the behavior of which key is chosen then
I don't see how we can be considered wrong by choosing any approach.

>
>>> (2) Add another option {repeats,first|last|error} with default last.
>>>    This would be my preference.
>>
>> My only experience is CouchDB related. We pretty much ignore repeated
>> keys.
>
> But you cannot ignore them.  You have to do *something* with them.
> Take the first, take the last, keep them all, none of these counts as
> *ignoring* repeated keys.
>

Sure you can. They roundtrip through just fine. The proplists module
chooses the first occurrence if its something we care about. Beyond
that, if someone came to me and said "choosing the first is weird" I
would say "stop using repeated keys".

>> If someone sends us data that has them, we (more or less) store
>> it and will send it back. We're bit of an awkward case because we
>> reserve a key namespace to do this. I don't really have a solid
>> argument on the repeats issue other than to ask "how do we define this
>> given that no one else seems to care".
>
> The requirement in RFC 4627 is just that "names within an object SHOULD
> be unique", but "SHOULD" means (RFC 2119) that this rule may be broken
> if someone thinks they have a really good reason and have thought about
> it for a bit.  I've just sent e-mail to Douglas Crockford asking for
> clarification.
>

The last I heard his response to this was along the lines of "I never
imagined someone would *actually* have repeated keys." Which to me is
a bit nutty because that sounds like he misspelled "MUST".
Unfortunately that misspelling came out as "SHOULD" and we're left
with the fact that it has no relevance what is intention was.

> You see one thing strikes me about CouchDB or any other JSON database.
> There are two draft proposals,
>   http://tools.ietf.org/html/draft-pbryan-zyp-json-pointer-00
>     JSON pointers.
>   http://tools.ietf.org/html/draft-pbryan-json-patch-01
>     JSON patches:
> which seem like things that a JSON database might need to support:
> "give me this part of this thing" and
> "make these changes to these parts of this thing"
> and I do not see how they can possibly be expected to work if an object
> has duplicate keys.

The fact that those are drafts and haven't been widely adopted should
be an indicator of their level of maturity. Also, how would they not
work with duplicate keys? For instance, the pointer proposal (that
patch uses) might as well use the proplists modules when recursing
into the JSON object. It'll pick the first occurrence. If someone said
"That's a bug!" I would say "Stop repeating keys!".

>> I kinda sorta agree. But I would argue that we must absolutely write
>> any sort of spec on numbers in terms of contemplating complete
>> ignorance of the source of the JSON. The proposition for saying "parse
>> these numbers in such and such a way" promote the implicit assumption
>> of "I know what this data is".
>
> There are at least two different "styles" of using JSON data.
> (1) JSON database: take what you are given and give it back
> (2) JSON protocol: here is a message to be acted on, I know exactly
>    what I'm supposed to do with it.  (E.g., json-rpc.org.)
>
> If you are using a JSON protocol, you had *BETTER* know what the data are.
>

I would rephrase this as "you had *BETTER* know what to *EXPECT* from
the data". Protocols still need to test that their inputs match their
expectations just like anyone else.

>> As a core library I don't think that's
>> a point of view that should be promoted.
>
> I do not think that a core library should be forcing copying-and-conversion
> when the client IS in a position to say what it is expecting.
>
> I note, for example, that the Go library for JSON has as one of its
> modes of operation "decode a JSON term into this Go object right here".
>
>>> The *right* answer of course is to use frames (Joe Armstrong's
>>> "proper structs"), so it *should* be <{}>.
>>>
>>
>> I admit that I haven't worked with Erlang full time. I've probably
>> been using it in a hobby capacity for about three years now. Perhaps
>> I'm missing something huge but I can't even attempt to say what a
>> "frame" or "proper struct" or "<{}>" means. If its the proposed syntax
>> for 'Python dicts' (no better reference point) then cool.
>
> Frames/structs have roots older than Python (the appropriate reference
> might well be Ait-Kaci's LIFE and psi-terms).  But yes, except immutable
> like all Erlang data.  I was taking the opportunity to plug this once
> again: it really is time those were in the language.
>
>> Bottom line is that as someone not super familiar with Erlang, I tend
>> to think that V =:= term_to_binary(binary_to_term(V)) would evaluate
>> to true.
>
> Why would you think that?  It has never been so.
>
> 1> V = <<>>.
> <<>>
> 2> V =:= term_to_binary(binary_to_term(V)).
> ** exception error: bad argument
>     in function  binary_to_term/1
>        called as binary_to_term(<<>>)
> 3>
>
>>  I'm sure there are cases that can be constructed that violate
>> that, but in general I expect things going through the external term
>> format to come back identical.
>
> That's a DIFFERENT expression:
>
> T =:= binary_to_term(term_to_binary(T)).
>

Right, I swapped that on accident. Are there examples that violate this version?

>> That could surely be a misinterpretation, but in my experience the
>> "to_term" connotation denotes something more than an "encoding".
>
> What experience are you talking about here?
> The only "to_term" function I can find in stdlib or kernel is
> binary_to_term/1, which is not total, and is nothing other than
> a decoding.
>>
>> An alternative way to put it would be that not everyone knows the
>> Prolog history. I see what term_to_binary does, and I wouldn't blame
>> people for being misguided by a "term_to_json".
>
> term_to_binary/1 is total, and is nothing more than an encoding.
>
> atom_to_binary/2 is not in principle total, and is nothing more than
> an encoding.
>
> atom_to_list/1 is total, and is nothing more than encoding.
>
> binary_to_atom/2 is not total, and is nothing more than a decoding.
>
> binary_to_list/1 is total.
> list_to_binary/1 is not.
>
> binary_to_term/2 is not total.
>
> float_to_list/1 is total (on floats), but
> list_to_float/1 is not.
>
> integer_to_list/[1,2] are total (on integers), but
> list_to_integer/[1,2] are not.
>
> erlang:ref_to_list/1 doesn't have a documented
> erlang:list_to_ref/1, and trying it, there doesn't
> seem to be an undocumented one either.
>
> and so it goes.  "_to_" does not, *IN ERLANG*, connote isomorphism
> or totality or anything other than some sort of partial one-way
> conversion.
>

Perhaps my personal experience is a bit tilted towards term_to_binary,
but I tend to wonder how many people are going to ask questions like
"Why can't this tuple be converted to JSON?" or something similar. Or
in other words is "term_to_json" is a bit of a lie? Because what we
really mean is "subset_of_erlang_term_arrangements_that_can_be_converted_to_json_to_json",
which is slightly more wordy.

>
>>>> For instance, consider pattern matching a term returned from
>>>> json:decode. One of the proposals is to convert keys to atoms when
>>>> they can be converted. This means your return value might have some
>>>> keys as binaries and some as atoms.
>>>
>>> And if you have an atom in a pattern, that key will be that atom.
>>
>> Sure, but how do you write your pattern matching and guards?
>
> In the usual way.

For example?

>
>> Including
>> a "is_atom" guard could cause functions to fail unexpectedly if
>> someone sent a unicode string. How do you explain to someone not
>> familiar with the situation why a key isn't an atom in that case?
>
> If someone isn't familiar with the situation, why do they want to
> know, and would they understand the explanation if offered one?
>
> What I am saying is that IF you write a pattern containing
> explicit atoms AND you ask for keys to be converted using
> list_to_explicit_atom/1 THEN you won't get any nasty surprises.
> Keys that you DON'T mention might arrive as atoms or might
> not, but if you knew enough about them to care, you would
> mention them somewhere in your program, and there would be no
> problem.
>

I've had this discussion elsewhere but never found a solution, but,
what happens if the JSON was decoded before your module was loaded?
Something like "mymodule:do_something(json:decode(Stuff))" is the
first call to mymodule?

What about code reloading? Someone adds clauses to a function spec,
reloads a gen server, and messages in its mailbox had already been
decoded JSON from some source?

These are the types of corner cases that seem like they would bite
even the seasoned veterans once or twice before they just made sure to
always use binaries. And by that I would say, why not make keys
binaries, and if people want atoms they can convert (easily at that).

> Again, if you are thinking of JSON in terms of *protocols*, this is
> a good approach.  If you are thinking of JSON in terms of data for
> a data base, not so.  But neither of these views should be privileged
> over the other, which is why the EEP offers more than one approach.
>
>

I understand what you're saying but I disagree with it. There's no
reason that atoms are any better for protocols or not. The only thing
that happens is you don't have to add <<"">> to your pattern matching
as long as you hope to never hit a weird corner case with the atom
table.

Thanks,
Paul Davis