[eeps] JSON

Mon Jul 4 00:32:21 CEST 2011

Alisdair's last response doesn't appear to have been addressed to the
list so I have not snipped any of it in my reply.

On Sun, Jul 3, 2011 at 5:31 PM, Alisdair Sullivan
<alisdairsullivan@REDACTED> wrote:
>
> On 2011-07-03, at 8:39 AM, Paul Davis wrote:
>
>> On Sun, Jul 3, 2011 at 3:04 AM, Alisdair Sullivan
>> <alisdairsullivan@REDACTED> wrote:
>>>
>>> On 2011-07-02, at 9:43 PM, Paul Davis wrote:
>>>
>>>>> The second point on numbers in the EEP is the part about numeric
>>>>> ranges. Obviously we'd like to support as much as we possibly can. The
>>>>> EEP rightly points out that ranges are not specified (explicitly
>>>>> mentioned as unspecified) and that an implementation is free to do as
>>>>> it wants. This is a second vote in favor of "if it looks like an
>>>>> integer, treat it like one" because of the case for bignums. If a
>>>>> client needs to support some mutation of the output of the parser
>>>>> they're more than welcome to write a simple function to do the
>>>>> conversion.
>>>
>>> I am strongly in favour of numerical fidelity. As much accuracy as possible should be retained. UTF-8 encoded binaries serve this end best, but place additional burden on client code as they are required to convert them to a useable form. Most sensible is to convert, where possible, to integers (regardless of whether they contain a decimal point, exponent or they are negative zero) or failing that, to native erlang floats. Numbers that cannot be converted to either should be returned as UTF-8 encoded binaries.
>>>
>>
>> The only thing I would wonder about in converting numbers to integers
>> when they have an exponent is that some of the straight forward
>> conversion functions would need to be rewritten in C instead of using
>> stdlib calls. It's not a deal breaker by any means, just a possible
>> area for slight differences in behavior to sneak in.
>>
>> As to the "cannot be converted" there's going to be a bit of an issue
>> there as well for changes in behaviour. Specifically, the C stdlib
>> will truncate some numbers without any signal that it did. So either
>> anything in C will have to rewrite these number routines or break this
>> condition.
>>
>> Bottom line, I would like to try as hard as possible to support as
>> much numeric fidelity but at a certain point these requirements extend
>> quite far beyond what other data transport languages would expect.
>> Even scenarios as pervasive as money math require special
>> consideration by end users. In other words, trying too hard might be
>> more of a surprise to users than if we just behave like most other
>> data exchange formats.
>
> Specialized handling is already required to correctly convert some legal JSON numbers to erlang floats. I am not sure why the C stdlib matters, unless you mean the built ins written in C that currently are used to convert JSON encoded numbers to erlang numbers. With the vagueness of the JSON spec with regards to the legality of various numbers and the limitations on what numbers are representable on various platforms there will always be incompatibilities and inconsistencies when dealing with hetero environments. Any eep0018 implementation should first and foremost be concerned with what makes most sense for erlang users. That means integers and bignums whenever possible, floats when necessary and a still useable if inconvenient format when both are impossible.
>

The two examples I was considering were thus:

Firstly, anything with a decimal point or exponent needs to go through
strtod as strtol rejects them. Special casing things that end with
(.0+)?(E\d+)? or similar means that there would need to be special
handling of the conversion. AFAIK even Erlang would require special
case checks to reformat these to something it could handle.

More importantly though, for doubles, we would have to write our own
implementation of strtod (whether or not at the Erlang or C level)
because of the significant digit truncation which is silent.

For instance, Python and JavaScript both truncate:

     "1.1234567890123456789012345678901234567890"

to:

    1.123456789012345691247674039914

Which is what strtod gives, and strtod gives it without signaling an
error. Thus, if the EEP includes "Can't be represented as a float or
double, then return as binary" we'll have to implement our own strtod
equivalent.

>>
>>>
>>>>> The most controversial aspect of the EEP in terms of something not
>>>>> caused by RFC 4627 is the representation of an empty object. Alisdair
>>>>> Sullivan (the author of jsx) and I have spent many hours arguing over
>>>>> this exact detail. If you find us bored on IRC at the same time, we
>>>>> generally agree to start this argument out of good nature just to
>>>>> amuse ourselves as a continuing joke. But the bottom line is that
>>>>> we're both familiar with the issue, the ups and downs of what we
>>>>> prefer and still cannot come to agreement on it. Mandating a specific
>>>>> one is going to upset some people and please others. Making it
>>>>> optional is a fine compromise, but then we'll argue over the default.
>>>>> I obviously can't offer much in the way of advice on this issue except
>>>>> to say that I am right and the one true object representation is {[]}.
>>>>> XD
>>>
>>> [{}] is obviously the superior representation. It has the advantage that it is itself a valid proplist (with no useable members), distinguishable from the empty array but still useable without first checking it is non-empty. {[]} requires an extra check every time you might want to operate on a json object with the proplists module.
>>>
>>
>> I'm not sure what you mean on these points. There are two general
>> patterns used extensively through out the CouchDB code base. The first
>> is a simple function that operates on JSON values. Say, multiple all
>> numbers by 2.
>>
>> mult2(N) when is_number(N) ->
>>    N * 2;
>> mult2(Vals) when is_list(Vals) ->
>>    lists:map(mult2, Vals);
>> mult2({Vals}) when is_list(Vals) ->
>>    lists:map(fun({K, V}) -> {K, mult2(V)} end, Vals);
>> mult2(Else) ->
>>    Else.
>>
>> The other is when you have a JSON object and you'd like to get a sub-object:
>>
>>    {Props} = proplists:get_value(views, DesignDoc),
>>
>> To me this is more elegant and natural than trying to pattern match in
>> function clauses on [{_, _} | _] or [{}] and similar.
>>
>> Also, "used as a proplist without checking" scares me because the
>> proplists module also doesn't throw errors if its applied to a decoded
>> array. So I would argue that users *should* check that they're getting
>> what they expect instead of blindly applying functions to decoded
>> objects. The 1-tuple makes these checks easier than the alternate
>> version IMO.
>
> Given that you know a particular bit of JSON is an object, it is much more convenient to be able to do:
>
> proplists:get_value(foo, JSON)
>
> rather than first having to match out the containing tuple. If you do have an empty object, you'll get the (expected) 'undefined'.
>

When would you know something is an object without a chance to pattern
match away the containing tuple? I would argue that forcing the
pattern match is going to help more because it'll catch when someone
sent you something funny.

Also, mochijson2 has a callback for handling the creation of objects
from lists of k/v pairs. Perhaps we can just support that to short
circuit this debate? (Of coures, defaulting to {[]} :D)

>>
>>>
>>>>> Now we get to the crazy.
>>>>>
>>>>> Everything in the EEP about encodings should be removed and replaced
>>>>> with "we only support well formed UTF-8." I'm aware that's a bit
>>>>> controversial so let me try and explain without too much of a rant
>>>>> against 4627.
>>>
>>>>> The bottom line is that JSON has no reliable inband method for
>>>>> determining character encoding. The only way to make it work sanely is
>>>>> to declare one and stick to it. Fortunately, most implementations seem
>>>>> to follow along only supporting UTF-8. jsx is the only library I know
>>>>> that attempts to support the spec point by point on this (disregarding
>>>>> languages that have a notion of Unicode strings as separate from
>>>>> normal strings).
>>>
>>> All legal JSON is unambigiously identifiable when restricted to only handling utf-8, utf-16le/be and utf-32le/be. It's possible utf-ebdic is identifiable also, but I have no experience with it and have made not attempt to handle it in jsx. I see no reasons not to support all five major UTF variants.
>>>
>>
>> The detail we disagree on here is that the spec says "All legal JSON
>> texts" which is different from "all legal JSON values". A JSON text
>> being defined as an array or object. If we restrict the parser to all
>> "JSON texts" as the RFC species, then yes, we can do icky icky things
>> to "guess" the encoding.
>>
>> Although, I would point out that Python and JavaScript will parse JSON
>> values, though Ruby apparently won't (at least with my limited
>> knowledge of that library). And once you allow values the "encoding
>> guess" hack becomes invalidated.
>
> I agree that all JSON texts should be either objects or arrays, but in general use almost all parsers accept naked JSON values as valid input. Most users expect their parsers to handle this. Allowing values only introduces one problem with guessing encoding that I am aware of, the single digits '0', '1', ..., '9' may be ambiguous when a streaming parser (such as jsx) receives only a single byte representing one of those digits. This could be a complete UTF-8 value or the leading byte of a UTF-16/32 little endian value. As naked numbers are ambiguous anyways in a streaming context (you can always add another digit to a number and still have a legal number) jsx requires users to signal end of stream when this potential ambiguity exists, sidestepping the issue. In a value context, where all input is known at runtime, there is no ambiguity and the only possible problem is mistakenly interpreting partial UTF-16/32 little endian values as complete UTF-8 values. Note that this problem is present even in the absence of detecting encodings.
>
> All other naked values have known, unambiguous byte orders that can be used for detection without any issues.
>

You misunderstood my argument I think. My point was that we *should*
be parsing values and this breaks the guessing game. I haven't studied
the exact bit patterns but I would think that bare strings whose first
character was not in ascii could possibly break the guessing
machinery. I can certainly come up with trivial examples that break
the truth table listed in the RFC, but perhaps it's just
underspecified. I also don't see how this would be a problem if we
just dictated that all data passed to the parser is in a specific
encoding and throw an error if its not.

>
>>
>>>
>>>>> Now if that's not crazy enough we get to the \uHHHH escape sequence
>>>>> definition. RFC 4627 says that strings may contain a \u escape
>>>>> followed by four hex characters. The RFC has some language about the
>>>>> Basic Multilingual Plan and an example of surrogate pairs but fails to
>>>>> cover the various ways this might break. For instance what happens if
>>>>> one of a surrogate pair is present without an appropriate mate?
>>>
>>>>> I was told specifically by Douglass Crockford in a thread on
>>>>> es5-discuss that implementations are expected to treat these escapes
>>>>> as bytes. There's no easy way to say it, this is just nuts. This
>>>>> requirement means that a conforming JSON parser must allow string data
>>>>> into Erlang terms that would raise exceptions when passed through the
>>>>> functions in the unicode module (this has bitten us in CouchDB
>>>>> multiple times).
>>>
>>> I believe Douglas Crockford wrong on this point. JSON containing escape sequences that would result in illegal codepoint sequences should be invalid. In the interests of pragmatism and in an effort to maintain compatibility with Javascript, invalid escape sequences should be ignored and treated as plain sequences of characters. That is, the JSON string "\uD800blahblah" should be converted to the erlang form <<"\\uD800blahblah">>. Properly formed surrogate pairs can of course be converted to the appropriate unicode codepoint.
>>>
>>> This is probably a violation of the JSON spec, strictly speaking, but seemed to most sane of possible compromises. Possibly an option could be added to reject json forms that contain invalid escapes.
>>>
>>
>> I would be willing to implement both of these options. In Jiffy I went
>> with the reject with badarg but having an option to choose the
>> behaviour seems sane to me.
>
> Another possible approach I failed to mention is to replace invalid or problematic escape sequences with the unicode replacement codepoint, U+FFFD. This follows the JSON and Unicode spec recommendations, but loses potentially signifigant user data.
>

Yeah, my point of view is from a database author, so my stance is
towards "give back what was given" as much as possible so I never
considered the replacement scheme there. Adding it as an option seems
sane, though I could see there being a couple corner cases with
combining characters (ie, two high byte combining characters in a row,
is that one or two U+FFFD insertions?).

>>
>>>
>>>>> The discussions on when to convert atoms to JSON strings and JSON
>>>>> strings to atoms should probably be removed. In my experience, it is
>>>>> best if atoms can be converted to JSON strings because it allows
>>>>> people to write json:encode({[{foo, bar}]}). On the other hand, the
>>>>> ability to convert keys to atoms may look fine at first glance but in
>>>>> reality can cause lots of funny bugs.
>>>
>>> The options {label, atom} and especially {label, existing_atom} should be removed. The former provides an attack vector for any system that may be called upon to parse user provided json that is not easily worked around. The latter breaks referential transparency. The same function call with the same arguments may return different results at different times. Erlang is not strictly speaking a pure functional language, but it should strive to be whenever convenient. Labels are easily converted to whatever form the client wants after parsing.
>>>
>>>
>>> Unquoted keys and comments are often seen in JSON in common use. Both should be handled by the eep0018 implementation. However, more research is needed to determine the common consensus on how to handle these. Javascript identifiers and '/* .. */' style comments seem a decent starting point for debate.
>>
>> The only places I've seen such abominations are in parsers trying to
>> be clever. Not once in three years have I seen a complaint that
>> CouchDB doesn't allow comments or unquoted keys in JSON. The unquoted
>> keys part is slightly less of an abomination because its valid
>> JavaScript object syntax so the examples could get confused in places.
>>
>> I would vote very much against the comments because that's just
>> promoting the embrace and extend philosophy. I could be reasoned with
>> to expose an option for unquoted keys.
>
> I am mostly concerned with accepting any potentially valid JSON input. Both unquoted keys and comments are frequently encountered. Options should exist to reject JSON containing both for those interested in only allowing strict JSON text.
>
>

Well first, unquoted keys and comments are definitely *not* valid
JSON. I'd be interested to hear where you're encountering either of
these in real life though. CouchDB has client libraries in about
twenty languages at present and I've never heard of anyone having
issues with either of these two issues.