[eeps] JSON

Fri Jul 1 03:28:42 CEST 2011

Thanks for the thread continuation email, Joe. I wasn't subscribed
when this thread started.

Having implemented a couple versions of functions that are similar to
the definition in EEP0018 yet having diverged from it greatly I
thought I might share some thoughts on the EEP as its defined and as
it might be used.

Firstly, I would like to say that the EEP does a pretty good job of
identifying areas where RFC 4627 has some issues. The various details
are fairly well presented on the edge cases. I'll make a few notes in
order of increasing importance.

Firstly, the statement in 4627, "The names within an object SHOULD be
unique." is referenced and the EEP falls on the side of rejecting JSON
that has repeated names. Personally I would prefer to accept repeated
names because that allows us to capture the lowest common denominator
JSON implementation. The other important bit of this is that other
languages don't complain on repeated keys. JavaScript, Ruby, and
Python all just end up using the last defined value.

The EEP discusses the difference between an integer and a float in
multiple places. The language seemed to revolve around being able to
detect the difference between a float and an integer. I know that it's
technically true that Erlang can detect the difference and JSON can't,
but when working with JSON in Erlang its never been an issue in my
experience. For instance, consider a quick snippet:

1> case 1.0 of 1 -> yay; 1.0 -> nay end.
nay
2> case 1.0 of A when A == 1 -> yay; A when A == 1.0 -> nay end.
yay

You can't pattern match on the term, but the guard is fairly trivial.
I would argue that "if it looks like an integer, its probably an
integer". From the point of view of writing a function that expected
input from a JSON decoder, I would consider requiring the pattern
matched value of being a specific type to be a buggy implementation
(barring highly specific constraints to the contrary). In the end, I
would basically say to relax the spec here. If someone needs to
guarantee the type of a number they should do it in client code.

The second point on numbers in the EEP is the part about numeric
ranges. Obviously we'd like to support as much as we possibly can. The
EEP rightly points out that ranges are not specified (explicitly
mentioned as unspecified) and that an implementation is free to do as
it wants. This is a second vote in favor of "if it looks like an
integer, treat it like one" because of the case for bignums. If a
client needs to support some mutation of the output of the parser
they're more than welcome to write a simple function to do the
conversion.

The most controversial aspect of the EEP in terms of something not
caused by RFC 4627 is the representation of an empty object. Alisdair
Sullivan (the author of jsx) and I have spent many hours arguing over
this exact detail. If you find us bored on IRC at the same time, we
generally agree to start this argument out of good nature just to
amuse ourselves as a continuing joke. But the bottom line is that
we're both familiar with the issue, the ups and downs of what we
prefer and still cannot come to agreement on it. Mandating a specific
one is going to upset some people and please others. Making it
optional is a fine compromise, but then we'll argue over the default.
I obviously can't offer much in the way of advice on this issue except
to say that I am right and the one true object representation is {[]}.
XD

Now we get to the crazy.

Everything in the EEP about encodings should be removed and replaced
with "we only support well formed UTF-8." I'm aware that's a bit
controversial so let me try and explain without too much of a rant
against 4627.

First, exhibit A, quoting RFC 4627:

    "JSON text SHALL be encoded in Unicode."

This is plainly a misunderstanding of what Unicode is. You can't
"encode in Unicode". Unicode is not an encoding. Unicode is a large
set of intertwined specifications on how to represent written language
in a computer. It's about as sane as saying "JSON text SHALL be
encoded in Salted Sardines".

Now its possible that we consider the greater context and try to parse
this along the lines of "JSON text SHALL be represented as an encoding
of Unicode characters" and we can start to make a bit more headway.
The RFC goes into a bit of an explanation of how you can detect the
encoding based on the first four bytes of the JSON text. Although, as
pointed out in the EEP this is kinda nutty because it is
unintentionally saying "JSON text SHALL be encoded in UTF-8, UTF-16BE,
UTF-16LE, UTF-32BE and UTF-32LE". Which is a considerably different
proposition.

The bottom line is that JSON has no reliable inband method for
determining character encoding. The only way to make it work sanely is
to declare one and stick to it. Fortunately, most implementations seem
to follow along only supporting UTF-8. jsx is the only library I know
that attempts to support the spec point by point on this (disregarding
languages that have a notion of Unicode strings as separate from
normal strings).

Now if that's not crazy enough we get to the \uHHHH escape sequence
definition. RFC 4627 says that strings may contain a \u escape
followed by four hex characters. The RFC has some language about the
Basic Multilingual Plan and an example of surrogate pairs but fails to
cover the various ways this might break. For instance what happens if
one of a surrogate pair is present without an appropriate mate?

I was told specifically by Douglass Crockford in a thread on
es5-discuss that implementations are expected to treat these escapes
as bytes. There's no easy way to say it, this is just nuts. This
requirement means that a conforming JSON parser must allow string data
into Erlang terms that would raise exceptions when passed through the
functions in the unicode module (this has bitten us in CouchDB
multiple times).

</rant>

Anyway, that's a long way of saying that JSON is weirder than it looks
at first glance. A lot of these issues are the cause of patches that
have been applied to mochijson2.

Back to more concrete EEP related things:

First, the current EEP makes a few comments about being a
specification of how data would be converted as it passes through
functions to convert to and from JSON. I'm a firm believer in this
approach. As it mentions later on, there are two fairly fundamentally
different API's to make the conversion: value based and event based.
My primary concern is value based, but there are very obvious
scenarios where an event based parser might be preferable (CouchDB
uses both). Out of pragmatism I might say that BIF's should be value
based because the event based opens up a lot more surface area to spec
out without too much prior art that I'm aware of but that's not a
major issue.

Second, the EEP uses the function names term_to_json and json_to_term.
I would comment that these infer that the conversion is lossless.
There some discussion in the EEP admitting that it isn't and there's
no way to make it so. I would suggest changing them to either
json:encode and json:decode, or perhaps erlang:encode_json and
erlang:decode_json as the community sees fit. Its minor but it seems
to not seem to suggest the identity conversions are guaranteed thus
removing the need for a lot of text on why its not lossless.

Other places in the spec talk about JSON strings compared to binaries
and lists. I'm pretty sure the EEP rules out converting from JSON *to*
an Erlang string. This is good because other languages do not conflate
[102, 111, 111] with "foo" and allowing a conversion there would lend
itself to very confusing conversations with non-Erlangers.

The discussions on when to convert atoms to JSON strings and JSON
strings to atoms should probably be removed. In my experience, it is
best if atoms can be converted to JSON strings because it allows
people to write json:encode({[{foo, bar}]}). On the other hand, the
ability to convert keys to atoms may look fine at first glance but in
reality can cause lots of funny bugs.

For instance, consider pattern matching a term returned from
json:decode. One of the proposals is to convert keys to atoms when
they can be converted. This means your return value might have some
keys as binaries and some as atoms. If you're writing a function to do
some mutation to the returned term that touches the key, it is quite
possible that you have to special case both term and atom return
types. The other obvious argument (which is detailed in the EEP) is
that it's an attack vector by malicious clients. It's possible to send
many JSON texts with many different keys that eventually kills the
Erlang VM. I'm all for "let it fail" but "here is how to kill me" is
probably not good.

Another fun one that I learned from a not-JSON example is the
to_existing_atom option. This could lead to different results based on
when the JSON is parsed and what code is loaded. Basically, the output
of the json:decode function would depend on what the current atom
table is. When you get into code reloading and other areas, this can
get a bit wonky. While it may prevent the attack vector, it introduces
the possibility of very hard to track down bugs when people are
iterating over a proplist only to discover that on one VM a key is a
binary and not the other.

If you say "check for both" we get to the conclusion "always use
binaries" which is most sane. If users need to extract keys to be
later used as atoms, they can do that quite easily.

With all that said, I would vote that Erlang doesn't adopt an
"official" implementation for at least the time being. JSON is a very
enchantingly simple specification, but when push comes to shove its
terribly complex to nail down into any sort of consensus. I like the
libraries I've written. Alisdair's JSX is extremely well written. Yet
we have wildly different ideas of "how things should be" in relation
to JSON.

In closing, the original motivation for this thread was Robert Virding
musing, "The important thing is that it is *there* and that it is a
good representation, otherwise we might end up with something bad just
because that is all there is." I would argue that there are a number
of quality JSON implementations and choosing one now would be to end
innovation on the matter.

Thanks,
Paul J Davis

On Thu, Jun 30, 2011 at 6:05 PM, Joe Williams <joe@REDACTED> wrote:
> This is for Paul Davis.
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Thursday, June 30, 2011 at 2:51 PM, Loïc Hoguin wrote:
>
> On 06/30/2011 11:39 PM, Robert Virding wrote:
>
> At the Erlang Factory in London after the EEPs run-through we had a
> small very informal discussion. As a result of that and after a
> discussion on erlang-questions I think it is very important that we
> decide something about eep-18 and JSON. I think we should propose a
> standard representation and write an OTP module which implements
> encoding/decoding this. The first version doesn't have to be that
> fast, mochijson2 which is being used apparently isn't fast, and it can
> be improved later both with better erlang and NIFs. The important
> thing is that it is *there* and that it is a good representation,
> otherwise we might end up with something bad just because that is all
> there is.
>
> jsx is already very good. It implements the EEP and is faster and more
> convenient to use than mochijson2 IMHO.
>
> See https://github.com/talentdeficit/jsx
>
> --
> Loïc Hoguin
> Dev:Extend
> _______________________________________________
> eeps mailing list
> eeps@REDACTED
> http://erlang.org/mailman/listinfo/eeps
>
>
> _______________________________________________
> eeps mailing list
> eeps@REDACTED
> http://erlang.org/mailman/listinfo/eeps
>
>