[eeps] JSON

Richard O'Keefe ok@REDACTED
Tue Jul 5 03:23:00 CEST 2011


On 5/07/2011, at 5:39 AM, Paul Davis wrote:
>> Why strain out the gnat of unquoted keys and swallow the camel of
>> duplicate keys?  If you want ambiguous illegal JSON quietly accepted,
>> why not want unambiguous illegal JSON quietly extended too?
>> 
>> Why is it important to accept JSON data with duplicate keys?
>> How common is such data?
>> On what grounds can you be sure that the originator intended the
>> rightmost binding to be used?  (If so, why did they put a duplicate
>> there in the first place?)
>> 
> 
> No. You're twisting the spec here.

What spec?

> The spec specifically relaxes the
> constraint on whether keys are unique.

What spec and where?  RFC 4627 says that keys SHOULD be unique.
The author of the RFC says that it SHOULD read that they MUST be unique.

> Thus we are well within our
> means to not enforce uniqueness. On the other hand, the spec
> specifically says that keys are quoted.

The JSON spec (RFC 4627) says explicitly that
	A JSON parser MAY accept non-JSON forms or extensions.
That means we MAY accept unquoted keys; we MAY accept comments;
we MAY accept Lisp S-expressions as well.  What we must not do is
GENERATE any of those things.

> 
> More importantly, Python, Ruby, and even JavaScript's native JSON
> parser reject unquoted keys.

So what?  There is no disagreement that we must not GENERATE them.

But by the same token, we should not GENERATE duplicate keys either.
The give-back-what-you-get argument falls to the ground there,
because whatever we get in the input, we should never be giving back
anything with duplicate keys.

I note that Javascript and Python accept several things that are
not legal according to the JSON RFC.  That's fine.  They are allowed
to support extensions.

I note also that Python's json module is perfectly happy to GENERATE
illegal JSON:
	>>> json.dumps(12)
	'12'
not legal JSON
	>>> json.dumps([1.2e500])
	'[Infinity]'
not legal JSON

It is OK to ACCEPT such extensions; it is not OK to GENERATE them.
> 
> In the end it really doesn't matter. Simply because everyone else does
> things slightly differently we are not bound to an expected behavior
> here. If no one can depend on the behavior of which key is chosen then
> I don't see how we can be considered wrong by choosing any approach.

And that includes raising an exception.

> 
>> 
>>>> (2) Add another option {repeats,first|last|error} with default last.
>>>>    This would be my preference.
>>> 
>>> My only experience is CouchDB related. We pretty much ignore repeated
>>> keys.
>> 
>> But you cannot ignore them.  You have to do *something* with them.
>> Take the first, take the last, keep them all, none of these counts as
>> *ignoring* repeated keys.
>> 
> 
> Sure you can. They roundtrip through just fine.

What you are describing is precisely one of the options that I described
as *NOT* counting as *ignoring* repeated keys.

Thing is, if you let repeated keys in, and then give them back,
*YOU* are then generating (what was meant to be) illegal JSON.

Tell you what.  Why not fish around in some CouchDB stores and see just
how many repeated keys there are?

I repeat what I wrote earlier:  people are going to start to expect
JSON data bases to support JSON pointers and JSON patches.  (Instead
of retrieving a possibly large object, making a change, and sending
it back, just sending a patch has to involve fewer packets, right?)
And that just plain cannot work sensibly if keys are non-unique.

Yes, these are drafts.  But XPath was a draft once.  RFC 5621 was a
draft once.  RFC 5789 was a draft once.  Given the HTTP PATCH
framework (RFC 5789), it's pretty clear that people putting and
getting JSON data will see the usefulness of being able to patch
it, and while the details of the operations may be subject to
change, just as it was always obvious that XML PATCH (RFC 5621)
would be based on XPath, it's clear that JSON PATCH will be based
on something close enough to JSON pointers to call it brother.

> The fact that those are drafts and haven't been widely adopted should
> be an indicator of their level of maturity. Also, how would they not
> work with duplicate keys?

Because we can only be sure that the sender's idea and the receiver's
idea of which subterm is identified by a particular key coincide
when there is only one occurrence of that key.  If the sender thinks
that /foo identifies the first "foo": and the receiver thinks that
it identifiers the last one, it's not going to work.

> For instance, the pointer proposal (that
> patch uses) might as well use the proplists modules when recursing
> into the JSON object. It'll pick the first occurrence. If someone said
> "That's a bug!" I would say "Stop repeating keys!".

And they will say "Then why in the name of common human decency didn't
you tell me LAST year when I entered the data with repeated keys?  How
am I supposed to fix it NOW?"

> 
> Perhaps my personal experience is a bit tilted towards term_to_binary,
> but I tend to wonder how many people are going to ask questions like
> "Why can't this tuple be converted to JSON?" or something similar.

And they will immediately get the very simple answer:

	"Because it does not correspond to any legal JSON datum."

I just gave a host of examples to point out that "X_to_Y" does *NOT*
in Erlang imply a total function and never has.  The failure of

	term_to_json(fun self/0)

should be no more surprising than the failure of

	list_to_atom([fun self/0])

[fun self/0] _is_ after all a perfectly good list.  It should certainly
be no more surprising than the failure of

	binary_to_term(<<>>)

> Or in other words is "term_to_json" is a bit of a lie?

No.
You sound just like my younger daughter.
When I said to her yesterday, "I promise that I'll look up the price
of <<something>> tomorrow", she said "you might be run over, so you
can't promise".

The general rule for X_to_Y(T) is something like
	if T is an X such that there is a defined representation
	of T as a Y, answer that representation, otherwise raise
	an exception.
The name term_to_json/1 fits *perfectly* into that scheme.

If you want to reject term_to_json/1, you will have to reject
binary_to_term/1.  Good luck with _that_!

>>>> And if you have an atom in a pattern, that key will be that atom.
>>> 
>>> Sure, but how do you write your pattern matching and guards?
>> 
>> In the usual way.
> 
> For example?

I really don't get you here.  The whole point is that the
keys->existing atoms and the keys->atoms options would have
you write the *SAME* code for matching specific keys explicitly.
(Provided, as Jeff Schultz has pointed out privately, that you
load the relevant modules before decoding the data in question.)

> I've had this discussion elsewhere but never found a solution, but,
> what happens if the JSON was decoded before your module was loaded?
> Something like "mymodule:do_something(json:decode(Stuff))" is the
> first call to mymodule?

It's not a question of it being the first _call_ to a module,
the Stuff has to be decoded before the module is _loaded_.

If you are using JSON for protocols, you don't do that.  You load
your applications, and then you start shunting messages around.

In short, the keys->existing atoms option is *NOT* appropriate
for use cases that involve storing decoded JSON.  I would not
expect CouchDB to have any use for it ever.

> 
> What about code reloading? Someone adds clauses to a function spec,
> reloads a gen server, and messages in its mailbox had already been
> decoded JSON from some source?

Again, I never suggested keys->existing atoms as the ONLY option.
It is the programmer's responsibility to make appropriate choices.
Reloading is not a problem unless
(1) the new module mentions certain atoms and expects them to be
    the result of key conversion and
(2) the old module did not mention those atoms and
(3) there are values decoded in the old state which the new module
    needs to look at.

In that situation, yes, it is a problem, but if you are using JSON
for protocols, you receive a JSON datum and immediately transform
it to something else, and it is the something else that you might
store.  *Transport* formats very seldom make good *processing*
formats.  For that kind of use of JSON data, you would never dream
of storing any of it, and point (3) never happens.

keys->existing atoms is appropriate for *THAT* kind of use.

It's an option.  There are other kinds of uses, for which other
options are more appropriate.  Big deal.  I repeat that the
"data base" use of JSON must not be privileged over the "protocol"
use, and the "protocol" use must not be privileged over the
"data base" use.

> 
> I understand what you're saying but I disagree with it. There's no
> reason that atoms are any better for protocols or not. The only thing
> that happens is you don't have to add <<"">> to your pattern matching

No, that's not the only thing.
Although if it were, it would not be a small issue.
Code readability *matters*.
Matching for atoms is faster.
We can do type checking with atoms that we cannot do with binaries.

We wouldn't even be having this particular debate if Erlang's atom
handling were fixed.  Sigh.  I'd really like to see that done first,
because it is a serious vulnerability.




More information about the eeps mailing list