[erlang-questions] [eeps] Joe Armstrong's suggestion for JSON<->Erlang BIFs

Fri Jul 25 10:15:27 CEST 2008

Richard,

Thanks for the great work on the JSON EEPS proposal!

Robert Chu and I wrote the first public Erlang library for JSON
(released thanks to A2Z Development, "an Amazon.com Company"), later
the basis for the parser in Yaws, and inspiration for dozens of
other independently-developed libraries whose authors must have
thought "I can do better than _this_!"

With that dubious introduction, let me offer some feedback.

In message <87B91711-F378-4A2B-BE43-75D388A354E6@REDACTED> you write:
>Specification
>
>    Four new functions are added to the erlang: module.
>
>    erlang:json_to_term(Io_List) -> term()
>    erlang:json_to_term(Io_List, Option_List) -> term()

This is assuming some sort of framing of the Io_List to limit it
to a single JSON term.  Since JSON is mostly self-framing (everything
but numbers declare their own ends), and since at least one historical
JSON-based transport took advantage of this (JSON-RPC 1.0 over raw
TCP), is there any interest in a flavor of the parser that allows
continuations of unparsed input?  For example, our parsing interface
was:

    decode(Continuation, CharList) =>
	{done, Term, LeftoverChars} or {more, Continuation}

>        The {binary,true} option says to convert all JSON strings
>        to Erlang binaries, even if they are keys in key:value pairs.
>        The {binary,false} option says to convert keys to atoms if
>        possible; it is the default.

I assume that the JSON strings will be turned into binaries according
to their wire format, i.e. according to the 'encoding' setting
above.

>    erlang:term_to_json(Term) -> binary()
>    erlang:term_to_json(Term, Option_List) -> Binary()

Unless the self-framing property of JSON is used in the transport,
the JSON term will have to be embedded in some sort of framing
protocol.  Might it give useful freedom to the implementation to
have term_to_json return an iodata() (which subsumes the option of
returning a single binary) to avoid unneeded data copies in the
output pipeline?

(Also more symmetric with json_to_term.)

>    Converting JSON to Erlang.
>
>        [...]
>
>        A string is converted to a UTF-8-encoded binary,
>        except where it occurs as a label in an "object".

Just to make it explicit, you should probably spell out that the
empty string becomes a zero-length binary, and vice-versa for the
opposite conversion.

>    Converting Erlang to JSON.
>
>        The atoms null, false, and true are converted to the
>        corresponding JSON keywords.  No other Erlang atoms are
>        allowed.

...except for object list keys.

>        An Erlang integer is converted to a JSON integer.
>        An Erlang float is converted to a JSON float, as precisely
>        as practical.

If the Erlang float has no fractional part, should it get a token
".0" padding?  Or maybe just a bare decimal point?

>Rationale
>
>    [...]
>
>    Clearly, Erlang->JSON->Erlang is going to be tricky.  To take
>    just one minor point, neither www.json.org nor RFC 4627 makes
>    an promises whatever about the range of numbers that can be
>    passed through JSON.  There isn't even any minimum range.  It
>    seems as though a JSON implementation could reject all numbers
>    other than 0 as too large and still conform!  This is stupid.
>    We can PROBABLY rely on IEEE doubles; we almost certainly cannot
>    expect to get large integers through JSON.

Well, this seems to be an issue for those in the business of
designing data formats based on JSON, I don't think it's an
issue for writing a parser in Erlang, and I see no need for
Erlang to hold itself back from exploiting its built-in bignums.

There are potential round-tripping problems for the path:

    Erlang->JSON->RandomOtherJsonProcessingChain->JSON->Erlang

not only from big integers, but also from the size of strings,
number of elements in a sequence, nesting depth, total size of
input, etc.  I don't think there's much to be gained from doing
any more than sweeping these under a "quality of implementation"
rug and moving on...

>    No, the point of JSON support in Erlang is to let Erlang programs
>    deal with the JSON data that other people are sending around the
>    net, and to send JSON data to other programs (like scripts in Web
>    browsers) that are expecting plain old JSON.  The round trip
>    conversion we need to care about is JSON -> Erlang -> JSON.

The E-J-E round-trip shouldn't be ignored in the design and
implementation, though.  When I wrote the unit tests for the A2Z
library, it was easiest to test the E-J-E round-tripping - with a
special equivalence test that ignored reordering of members within
an object list.  It's hard to even specify what the semantic
equivalence of JSON strings even means without translating to a
high-level representation - such as Erlang terms.

You could also define semantic round-tripping via string equality
of J2 and J3 in

	J1 -> E1 -> J2 -> E2 -> J3

where J1 is an arbitrary JSON string.

>    The main thing I have not accounted for is the {binary,true}
>    option of json_to_term/2.  For normal Erlang purposes, it is
>    much nicer (and somewhat more efficient) to deal with
>
>        [{name,<<"fred">>},{female,false},{age,65}]
>
>    than with
>
>        [{<<"name">>,<<"fred">>},{<<"female">>,false},{<<"age">>,65}]
>
>    If you are communicating with a trusted source that deals with
>    a known small number of labels, fine.  There are limits on the
>    number of atoms Erlang can deal with.  A small test program
>    that looped creating atoms and putting them into a list ticked
>    over happily until shortly after its millionth atom, and then
>    hung there burning cycles apparently getting nowhere.  Also,
>    the atom table is shared by all processes on an Erlang node,
>    so garbage collecting it is not as cheap as it might be.  As
>    a system integrity measure, therefore, it is useful to have a
>    mode of operation in which json_to_term never creates atoms.
>    Whether the default behaviour should be "safe" or "readable"
>    really depends on whether you intend to accept JSON from untrusted    
>    sources or not.  I've chosen the default to be what I would want
>    to use most of the time, but this is after all only a proposal.

There's another option.  The list_to_existing_atom(String) builtin
returns the atom whose text representation is String, but only if
there already exists such atom (i.e. it's already interned), raising
badarg otherwise.  So with existing mechanisms we could have options

    {object_label, binary}
    {object_label, atom}
    {object_label, existing_atom} % otherwise binary

Since a loading an application interns the atoms in its
source code, applications wouldn't have to do anything special
(except perhaps during code upgrades) to expect their JSON input
to be in the nicer term format.

Labels that are expected in the data, but not explicitly handled
by the application, could get an explicit nod in the source code
to enjoy the more efficient translation.

    ballast_labels() -> [rare_flag, unused_option, ignored_field].

Unexpected (or un-internable) labels would remain in binaries.

Hope the feedback is useful.

Jim