[eeps] JSON

Alisdair Sullivan alisdairsullivan@REDACTED
Mon Jul 4 08:18:20 CEST 2011


On 2011-07-03, at 3:32 PM, Paul Davis wrote:

> Firstly, anything with a decimal point or exponent needs to go through
> strtod as strtol rejects them. Special casing things that end with
> (.0+)?(E\d+)? or similar means that there would need to be special
> handling of the conversion. AFAIK even Erlang would require special
> case checks to reformat these to something it could handle.

This is true, but is relatively straightforward. The advantages of bignum math outweigh the increased implementation burden. 


> More importantly though, for doubles, we would have to write our own
> implementation of strtod (whether or not at the Erlang or C level)
> because of the significant digit truncation which is silent.
> 
> For instance, Python and JavaScript both truncate:
> 
>     "1.1234567890123456789012345678901234567890"
> 
> to:
> 
>    1.123456789012345691247674039914
> 
> Which is what strtod gives, and strtod gives it without signaling an
> error. Thus, if the EEP includes "Can't be represented as a float or
> double, then return as binary" we'll have to implement our own strtod
> equivalent.

This is not what I meant. Your example is within the range of ieee754 doubles. Rounding should be expected within the range +/- 5.0e-324 to 1.7976...e308. Values outside of this are currently not representable in erlang. These values are representable as strings, however. In the interests of fidelity, these json numbers should not automatically result in errors.


> You misunderstood my argument I think. My point was that we *should*
> be parsing values and this breaks the guessing game. I haven't studied
> the exact bit patterns but I would think that bare strings whose first
> character was not in ascii could possibly break the guessing
> machinery. I can certainly come up with trivial examples that break
> the truth table listed in the RFC, but perhaps it's just
> underspecified. I also don't see how this would be a problem if we
> just dictated that all data passed to the parser is in a specific
> encoding and throw an error if its not.

A sequence of bytes is either valid json for a given encoding or it is not. The intended meaning of the byte sequence is irrelevant. Therefore, it doesn't really matter if you can 'trick' the parser into accepting garbage data that is incidentally also json, it only matters that a given byte sequence can't have two different valid interpretations when decoded from two distinct encodings. Since codepoint U+0000 is never valid in a json text (unescaped) and all possible json texts start with a codepoint lying in the ascii range (0-127), none of the possible values that can begin a json text overlap in the set of UTF-8, UTF-16 (little or big) and UTF-32 (little or big).

The only problem with auto detection of encodings for json texts occurs when dealing with partial json texts; when incrementally parsing them for example. In a value context, when the entire byte sequence of the json text is known at runtime, this is not an issue.


>> Another possible approach I failed to mention is to replace invalid or problematic escape sequences with the unicode replacement codepoint, U+FFFD. This follows the JSON and Unicode spec recommendations, but loses potentially signifigant user data.
>> 
> 
> Yeah, my point of view is from a database author, so my stance is
> towards "give back what was given" as much as possible so I never
> considered the replacement scheme there. Adding it as an option seems
> sane, though I could see there being a couple corner cases with
> combining characters (ie, two high byte combining characters in a row,
> is that one or two U+FFFD insertions?).

It's two according to the unicode spec. I'm still undecided on what the appropriate response to invalid codepoints is. I'd prefer to just ignore invalid escape sequences and pretend they are unescaped data, but the unicode spec implies this is bad behaviour. I prefer to retain data whenever possible however.


> Well first, unquoted keys and comments are definitely *not* valid
> JSON. I'd be interested to hear where you're encountering either of
> these in real life though. CouchDB has client libraries in about
> twenty languages at present and I've never heard of anyone having
> issues with either of these two issues.

I've changed my mind on this after further discussion. Unquoted keys and comments should be rejected as errors.




More information about the eeps mailing list