[eeps] JSON

Paul Davis paul.joseph.davis@REDACTED
Fri Jul 1 08:05:32 CEST 2011


Richard,

Good points. I'll respond inline.

On Fri, Jul 1, 2011 at 1:12 AM, Richard O'Keefe <ok@REDACTED> wrote:
> As the author of EEP 18 I'd like to respond to this.
>

As a preface I very much appreciate your insight here. I think we more
or less agree on the major points. Its just that when it gets down to
the level of "what bits go where" that things get more interesting.

> On 1/07/2011, at 1:28 PM, Paul Davis wrote:
>
>>
>> Firstly, the statement in 4627, "The names within an object SHOULD be
>> unique." is referenced and the EEP falls on the side of rejecting JSON
>> that has repeated names. Personally I would prefer to accept repeated
>> names because that allows us to capture the lowest common denominator
>> JSON implementation.
>
> The problem here is that there are lots of possibilities:
>  - take the first binding for a name
>  - take the last binding for a name
>  - combine the bindings using a user-specified function
>  - do something weirder
>  - return a list of {Name,Value} pairs in left to right order
>   [not an option if the user wants a dictionary]
>  - return a list of {Name,Value} pairs in right to left order
>  - return a list of {Name,Value} pairs using some other order
>  ...

I think the issue here between our interpretations is that I've not
yet fully captured Erlang zen. In terms of Python it makes solid sense
that the last k/v par in byte order of the JSON stream is the winner.
In Erlang there's no magical dict type that dictates this. I say that
specifically because there is a dict module *and* a proplist module.
They do similar things but are not inherit to the language like Python
dict's.

The one thing I should've pointed out more so here is that I was
focusing on JSON.parse not erroring on this. If there is a reference
implementation of JSON parsing it should probably be the JavaScript
native version. Seeing as it doesn't fail on repeated keys, an Erlang
version probably shouldn't either. Granted, that doesn't dictate how
we deal with such things.

>
>> The other important bit of this is that other
>> languages don't complain on repeated keys. JavaScript, Ruby, and
>> Python all just end up using the last defined value.
>
> In the case of JavaScript, this really does seem to be a property
> of the *language*:
> js> var y = {a: 1, a: 2, b: 3, b: 4};
> js> y.a;
> 2
> js> y.b;
> 4

I'm not entirely certain what you're disagreeing with here. My point
was that the right most defined value is what ends up "winning" the
race. If you consider writing a parser that ends up parsing out a key
and value, and then just does something like "obj.store(k, v)" and
doesn't check that it overwrote something, the behavior is
unsurprising.

> In the case of Ruby and Python, it's a property of a library, not
> of the language.  The 'json' module that comes with python does
> this.  But it also does things that I regard as undesirable:
>        json.loads('{a:1}')
> dies horribly with a somewhat misleading error message.  Nothing
> stops there being other JSON parsers for Python (and there are),
> and nothing stops them making another choice.
>

I'm also not sure here. The example you provide is '{a:1}' which is
invalid JSON because the key is not a proper string. Interestingly
enough, the Ruby JSON parser seems to insist on a top level array or
object that neither JavaScript or Python do. JavaScript being the
interesting case because the RFC forbids parsing "1.0" as JSON.

> People using JSON to ship around data that originated as Erlang
> property lists would surely be expecting the _first_ value for a
> repeated key to be taken.
>

Something I should've mentioned in my earlier email is that most
definitely the EEP should in under no circumstance "assume that data
originated from Erlang". IMO that's just conflating many other much
deeper issues. I would propose that the EEP only considers JSON that
someone wrote when they were dehydrated and trying to cuddle a cactus.

> I can see two ways around this.
>
> (1) Follow the herd, and quietly take the last value for a repeated key.
>    I have to admit that the JSON parser I wrote in Smalltalk does
>    exactly this.
> (2) Add another option {repeats,first|last|error} with default last.
>    This would be my preference.

My only experience is CouchDB related. We pretty much ignore repeated
keys. If someone sends us data that has them, we (more or less) store
it and will send it back. We're bit of an awkward case because we
reserve a key namespace to do this. I don't really have a solid
argument on the repeats issue other than to ask "how do we define this
given that no one else seems to care".

>>
>> The EEP discusses the difference between an integer and a float in
>> multiple places. The language seemed to revolve around being able to
>> detect the difference between a float and an integer. I know that it's
>> technically true that Erlang can detect the difference and JSON can't,
>> but when working with JSON in Erlang it's never been an issue in my
>> experience.
>
> JSON can't detect anything.  The basic problem is that >Javascript<
> has only one type for all numbers.

Sorta kinda. Ignore JavaScrip's idiocy on this for a bit. JSON has a
syntax for numbers. Some of them can be interpreted as integers, some
not. I'm only suggesting: "val = can_be_int(Stuff) ? make_int(Stuff) :
make_double(Stuff)".

>
> The question here is that if you see [1,1.0] in JSON you do not know
> whether the sender intended them to be the same or different.  Indeed,
> if you see [123456712345671234567, 123456712345671234568] you do not
> know whether the sender intended the numbers to be the same or different.
> It's even the case that if you see [1.0] you do not know whether the
> sender intended that number to be usable as a subscript.
>
> If you are just holding onto the data and then giving it back, no real
> problem.  The problem arises if you do calculations.
>

Exactly. Though my point would be, "if someone gives us data we can
represent as an integer, do it. If someone gives us doubles, give them
no less than they might expect elsewhere."

The bottom line here is that anyone that is actually doing serious
numerical computation knows that doubles are going to have issues.
Unless you are super specific to the level of hardware architectures,
you must absolutely find some other way to transfer data. Erlang JSON
shouldn't feel obliged to fix this.

> Some possibilities include
>  - always read JSON numbers as IEEE doubles; that way you certainly
>   get the same kind and value of number as Javascript would get
>  - read things that look like integers as integers, things that look
>   like floats as floats
>   -- this is what the json module that comes with Python does;
>      it is also what my Smalltalk library does.
>  - read things that have integral values (whether written with an
>   exponent or not, whether with a ".0" or not) as integers, things
>   with a fractional part as floats
>
> I agree completely that "read things that look like integers as integers
> and things that look like floats as floats" seems to be the best
> default.  It's just not the _only_ _sensible_ thing to offer.
>

I kinda sorta agree. But I would argue that we must absolutely write
any sort of spec on numbers in terms of contemplating complete
ignorance of the source of the JSON. The proposition for saying "parse
these numbers in such and such a way" promote the implicit assumption
of "I know what this data is". As a core library I don't think that's
a point of view that should be promoted. Also, the amount of client
code to do something like "convert all numbers to floats" is on the
order of 7-8 lines. If someone wants to go out of their way to do
something crazy like that, I would prefer not to promote it.

>>
>> The most controversial aspect of the EEP in terms of something not
>> caused by RFC 4627 is the representation of an empty object.
>
> The *right* answer of course is to use frames (Joe Armstrong's
> "proper structs"), so it *should* be <{}>.
>

I admit that I haven't worked with Erlang full time. I've probably
been using it in a hobby capacity for about three years now. Perhaps
I'm missing something huge but I can't even attempt to say what a
"frame" or "proper struct" or "<{}>" means. If its the proposed syntax
for 'Python dicts' (no better reference point) then cool. But I can't
comment directly without a base of reference.

>> Everything in the EEP about encodings should be removed and replaced
>> with "we only support well formed UTF-8."
>
> There are two possible situations.
> (1) You are dealing with *text* data.  In that case, encoding or decoding
>    is somebody else's problem.  A JSON writer needs to know what characters
>    have to be escaped in the output, but that's it.
> (2) You are dealing with *binary* data.  In that case, encoding or
>    decoding is the JSON library's problem.
>
> When the EEP was written, things were a little bit fuzzy.  It is now clear
> that iolists are *binary* data, not text data.  There is now explicit
> Unicode support elsewhere in Erlang, so that conversion to/from UTF-8 can
> done in a separate step, if needed.  At any rate, UTF-8 is *required* as
> the default.  Other options could be added later if anyone really cares.

I would say that I should've prefaced my earlier bit with "I know
Erlang supports conversion to and from UTF-8" and it'd be trivial to
support a simple guard or 5 to make sure we pass UTF-8 to the actual
implementation. But I get the rage eyes when people talk about Unicode
and JSON so I went off on a tangent earlier.

Also, my earlier point about Unicode is that the JSON RFC punches a
huge hole in the separation of concerns. People tend to think the
encoding issue is just in the detecting phase and we can make that OOB
and transcode to a Roseta stone version, but my bigger point is that
the \uHHHH escapes punch a hole through that and try impose some
really bad requirements on internal string representation.

>>
>> Second, the EEP uses the function names term_to_json and json_to_term.
>> I would comment that these infer that the conversion is lossless.
>
> Apart from the fact that it should be "imply", not "infer",
> I deny this.  They come from the Prolog naming tradition and
> simply imply what the input is and what the output is and make
> *no* claim about losslessness.  (If the conversions were lossless,
> the "to_" part would not be present in the name.)
>
> Note, for example, that Erlang already has binary_to_term/1,
> which does NOT convert every possible binary to a term.
>
> However, there is nothing wrong with json:encode and json:decode
> as names either.
>

Apologies for vocab on that one. I rewrote that a few times thinking I
was screwing it up and finally just moved on in my thoughts.

Bottom line is that as someone not super familiar with Erlang, I tend
to think that V =:= term_to_binary(binary_to_term(V)) would evaluate
to true. I'm sure there are cases that can be constructed that violate
that, but in general I expect things going through the external term
format to come back identical.

That could surely be a misinterpretation, but in my experience the
"to_term" connotation denotes something more than an "encoding".

An alternative way to put it would be that not everyone knows the
Prolog history. I see what term_to_binary does, and I wouldn't blame
people for being misguided by a "term_to_json".

>> For instance, consider pattern matching a term returned from
>> json:decode. One of the proposals is to convert keys to atoms when
>> they can be converted. This means your return value might have some
>> keys as binaries and some as atoms.
>
> And if you have an atom in a pattern, that key will be that atom.

Sure, but how do you write your pattern matching and guards? Including
a "is_atom" guard could cause functions to fail unexpectedly if
someone sent a unicode string. How do you explain to someone not
familiar with the situation why a key isn't an atom in that case? Yes,
"its obvious" is the correct answer here, but its important to think
of how many times you have to say "its obvious."

>
>> If you're writing a function to do
>> some mutation to the returned term that touches the key, it is quite
>> possible that you have to special case both term and atom return
>> types. The other obvious argument (which is detailed in the EEP) is
>> that it's an attack vector by malicious clients. It's possible to send
>> many JSON texts with many different keys that eventually kills the
>> Erlang VM. I'm all for "let it fail" but "here is how to kill me" is
>> probably not good.
>
> The fundamental problem here is the fixed atom table.
> SWI Prolog hasn't had that problem for a long time.
> The Logix implementation of Flat Concurrent Prolog faced it and fixed it.
> It's a serious weakness in Erlang that can and should be fixed.
>
>
>

Well sure. And if I had a mulligan I'd donate it to the cause.

Thanks,
Paul Davis



More information about the eeps mailing list