[eeps] JSON

Tue Jul 5 04:09:05 CEST 2011

On Mon, Jul 4, 2011 at 9:23 PM, Richard O'Keefe <ok@REDACTED> wrote:
>
> On 5/07/2011, at 5:39 AM, Paul Davis wrote:
>>> Why strain out the gnat of unquoted keys and swallow the camel of
>>> duplicate keys?  If you want ambiguous illegal JSON quietly accepted,
>>> why not want unambiguous illegal JSON quietly extended too?
>>>
>>> Why is it important to accept JSON data with duplicate keys?
>>> How common is such data?
>>> On what grounds can you be sure that the originator intended the
>>> rightmost binding to be used?  (If so, why did they put a duplicate
>>> there in the first place?)
>>>
>>
>> No. You're twisting the spec here.
>
> What spec?
>
>> The spec specifically relaxes the
>> constraint on whether keys are unique.
>
> What spec and where?  RFC 4627 says that keys SHOULD be unique.
> The author of the RFC says that it SHOULD read that they MUST be unique.
>

I really can't say that I care what he was thinking when he wrote it.
He could've been thinking about squirrels playing in the tree.

>> Thus we are well within our
>> means to not enforce uniqueness. On the other hand, the spec
>> specifically says that keys are quoted.
>
> The JSON spec (RFC 4627) says explicitly that
>        A JSON parser MAY accept non-JSON forms or extensions.
> That means we MAY accept unquoted keys; we MAY accept comments;
> we MAY accept Lisp S-expressions as well.  What we must not do is
> GENERATE any of those things.
>
>>
>> More importantly, Python, Ruby, and even JavaScript's native JSON
>> parser reject unquoted keys.
>
> So what?  There is no disagreement that we must not GENERATE them.
>
> But by the same token, we should not GENERATE duplicate keys either.
> The give-back-what-you-get argument falls to the ground there,
> because whatever we get in the input, we should never be giving back
> anything with duplicate keys.

I'm not sure how you got here. The wording is quite clear that we are
within the limits of RFC 4627 to accept and generate duplicate keys.
Arguing that Crockford wasn't thinking about squirrels and was
thinking unique keys doesn't provide a solid argument for adopting
this particular constraint. There may be other arguments to include
it, but right now I don't see a clear winner amongst any of the
possible behaviors that have been discussed other than supporting all
of them and making it configurable.

>
> I note that Javascript and Python accept several things that are
> not legal according to the JSON RFC.  That's fine.  They are allowed
> to support extensions.
>
> I note also that Python's json module is perfectly happy to GENERATE
> illegal JSON:
>        >>> json.dumps(12)
>        '12'
> not legal JSON
>        >>> json.dumps([1.2e500])
>        '[Infinity]'
> not legal JSON
>

You're missing a key qualifier.

json.dumps(12) is an illegal JSON text but is a valid JSON value. Do
you want a parser that can't parser all JSON values?

I note that Python and JavaScript's JSON.parse both accept JSON values
where as Ruby only accepts JSON texts.

The [Infinity] output is just plain wrong, you should file a bug report.

> It is OK to ACCEPT such extensions; it is not OK to GENERATE them.
>>
>> In the end it really doesn't matter. Simply because everyone else does
>> things slightly differently we are not bound to an expected behavior
>> here. If no one can depend on the behavior of which key is chosen then
>> I don't see how we can be considered wrong by choosing any approach.
>
> And that includes raising an exception.
>
>>
>>>
>>>>> (2) Add another option {repeats,first|last|error} with default last.
>>>>>    This would be my preference.
>>>>
>>>> My only experience is CouchDB related. We pretty much ignore repeated
>>>> keys.
>>>
>>> But you cannot ignore them.  You have to do *something* with them.
>>> Take the first, take the last, keep them all, none of these counts as
>>> *ignoring* repeated keys.
>>>
>>
>> Sure you can. They roundtrip through just fine.
>
> What you are describing is precisely one of the options that I described
> as *NOT* counting as *ignoring* repeated keys.

I'm sorry, when you said "ignore" I though, "Do I do anything special
to consider such keys in any code from input bytes to output bytes"
and came up with "Nope." which I equated with ignoring.

>
> Thing is, if you let repeated keys in, and then give them back,
> *YOU* are then generating (what was meant to be) illegal JSON.
>
> Tell you what.  Why not fish around in some CouchDB stores and see just
> how many repeated keys there are?
>

Probably very few. But seeing as the spec allows this behavior, who cares?

> I repeat what I wrote earlier:  people are going to start to expect
> JSON data bases to support JSON pointers and JSON patches.  (Instead
> of retrieving a possibly large object, making a change, and sending
> it back, just sending a patch has to involve fewer packets, right?)
> And that just plain cannot work sensibly if keys are non-unique.
>
> Yes, these are drafts.  But XPath was a draft once.  RFC 5621 was a
> draft once.  RFC 5789 was a draft once.  Given the HTTP PATCH
> framework (RFC 5789), it's pretty clear that people putting and
> getting JSON data will see the usefulness of being able to patch
> it, and while the details of the operations may be subject to
> change, just as it was always obvious that XML PATCH (RFC 5621)
> would be based on XPath, it's clear that JSON PATCH will be based
> on something close enough to JSON pointers to call it brother.
>

I've seen people try and write these or similar RFC's for three years
without anything gaining traction. If and when one does gain traction
then I will consider its implications on JSON support. Also,
technically since RFC 4627 does allow duplicate keys these specs
should probably address that.

>> The fact that those are drafts and haven't been widely adopted should
>> be an indicator of their level of maturity. Also, how would they not
>> work with duplicate keys?
>
> Because we can only be sure that the sender's idea and the receiver's
> idea of which subterm is identified by a particular key coincide
> when there is only one occurrence of that key.  If the sender thinks
> that /foo identifies the first "foo": and the receiver thinks that
> it identifiers the last one, it's not going to work.
>
>> For instance, the pointer proposal (that
>> patch uses) might as well use the proplists modules when recursing
>> into the JSON object. It'll pick the first occurrence. If someone said
>> "That's a bug!" I would say "Stop repeating keys!".
>
> And they will say "Then why in the name of common human decency didn't
> you tell me LAST year when I entered the data with repeated keys?  How
> am I supposed to fix it NOW?"
>

"Do it the same way you did it last year," would be the obvious
solution I suppose.

>>
>> Perhaps my personal experience is a bit tilted towards term_to_binary,
>> but I tend to wonder how many people are going to ask questions like
>> "Why can't this tuple be converted to JSON?" or something similar.
>
> And they will immediately get the very simple answer:
>
>        "Because it does not correspond to any legal JSON datum."
>
> I just gave a host of examples to point out that "X_to_Y" does *NOT*
> in Erlang imply a total function and never has.  The failure of
>
>        term_to_json(fun self/0)
>
> should be no more surprising than the failure of
>
>        list_to_atom([fun self/0])
>
> [fun self/0] _is_ after all a perfectly good list.  It should certainly
> be no more surprising than the failure of
>
>        binary_to_term(<<>>)
>
>> Or in other words is "term_to_json" is a bit of a lie?
>
> No.
> You sound just like my younger daughter.
> When I said to her yesterday, "I promise that I'll look up the price
> of <<something>> tomorrow", she said "you might be run over, so you
> can't promise".
>
> The general rule for X_to_Y(T) is something like
>        if T is an X such that there is a defined representation
>        of T as a Y, answer that representation, otherwise raise
>        an exception.
> The name term_to_json/1 fits *perfectly* into that scheme.
>
> If you want to reject term_to_json/1, you will have to reject
> binary_to_term/1.  Good luck with _that_!
>

And you sound like my crotchety grandpa yelling about how they don't
make function names like they used to.

I prefaced this very specifically with "in my experience" because I
was trying to say "I thought this, it is not something I would find
terribly surprising for other people to think, perhaps we should
change it in an attempt to avoid such confusion."

>>>>> And if you have an atom in a pattern, that key will be that atom.
>>>>
>>>> Sure, but how do you write your pattern matching and guards?
>>>
>>> In the usual way.
>>
>> For example?
>
> I really don't get you here.  The whole point is that the
> keys->existing atoms and the keys->atoms options would have
> you write the *SAME* code for matching specific keys explicitly.
> (Provided, as Jeff Schultz has pointed out privately, that you
> load the relevant modules before decoding the data in question.)
>
>> I've had this discussion elsewhere but never found a solution, but,
>> what happens if the JSON was decoded before your module was loaded?
>> Something like "mymodule:do_something(json:decode(Stuff))" is the
>> first call to mymodule?
>
> It's not a question of it being the first _call_ to a module,
> the Stuff has to be decoded before the module is _loaded_.
>
> If you are using JSON for protocols, you don't do that.  You load
> your applications, and then you start shunting messages around.
>
> In short, the keys->existing atoms option is *NOT* appropriate
> for use cases that involve storing decoded JSON.  I would not
> expect CouchDB to have any use for it ever.
>
>>
>> What about code reloading? Someone adds clauses to a function spec,
>> reloads a gen server, and messages in its mailbox had already been
>> decoded JSON from some source?
>
> Again, I never suggested keys->existing atoms as the ONLY option.
> It is the programmer's responsibility to make appropriate choices.
> Reloading is not a problem unless
> (1) the new module mentions certain atoms and expects them to be
>    the result of key conversion and
> (2) the old module did not mention those atoms and
> (3) there are values decoded in the old state which the new module
>    needs to look at.
>
> In that situation, yes, it is a problem, but if you are using JSON
> for protocols, you receive a JSON datum and immediately transform
> it to something else, and it is the something else that you might
> store.  *Transport* formats very seldom make good *processing*
> formats.  For that kind of use of JSON data, you would never dream
> of storing any of it, and point (3) never happens.
>
> keys->existing atoms is appropriate for *THAT* kind of use.
>
> It's an option.  There are other kinds of uses, for which other
> options are more appropriate.  Big deal.  I repeat that the
> "data base" use of JSON must not be privileged over the "protocol"
> use, and the "protocol" use must not be privileged over the
> "data base" use.
>
>>
>> I understand what you're saying but I disagree with it. There's no
>> reason that atoms are any better for protocols or not. The only thing
>> that happens is you don't have to add <<"">> to your pattern matching
>
> No, that's not the only thing.
> Although if it were, it would not be a small issue.
> Code readability *matters*.
> Matching for atoms is faster.
> We can do type checking with atoms that we cannot do with binaries.
>
> We wouldn't even be having this particular debate if Erlang's atom
> handling were fixed.  Sigh.  I'd really like to see that done first,
> because it is a serious vulnerability.
>
>

I definitely agree that fixing the atom table would be even better. If
that were to come to pass I would be whole heartedly in favor of keys
to atoms, but until then I'm still in favor of binaries because of the
oppurtunnity for errors, especially the ones that implicitly depend on
system state.