EEP: XXX
Title: JSON bifs
Version: $Revision: 14 $
Last-Modified: $Date: 2007-06-29 16:24:01 +0200 (Fri, 29 Jun 2007) $
Author: Richard A. O'Keefe <ok@cs.otago.ac.nz>
Status: Draft
Type: Standards Track
Erlang-Version: R12B-4
Content-Type: text/plain
Created: 25-Jul-2008
Post-History:


Abstract

    "JSON (JavaScript Object Notation) is a lightweight
    data-interchange format. It is easy for humans to read and write.
    It is easy for machines to parse and generate."
    (Opening text of the http://www.json.org web site.)

    JSON is specified by RFC 4627, which defines a Media Type
    application/json.

    There are JSON libraries for a wide range of languages, so it is
    a useful format.  There are already JSON bindings for Erlang,
    but on the 24th of July 2008, Joe Armstrong suggested that it
    would be worth having built in functions to convert Erlang terms
    to and from the JSON format.

    term_to_json        -- convert a term to JSON form
    json_to_term        -- convert a JSON form to Erlang


Specification

    Four new functions are added to the erlang: module.

    erlang:json_to_term(Io_List) -> term()
    erlang:json_to_term(Io_List, Option_List) -> term()

        Types:
            Io_List = iolist()
            Option_List = [Option]
            Option = {encoding,atom()}
                   | {float,bool()}
                   | {binary,bool()}

        json_to_term(X) is equivalent to json_to_term(X, []).

        The Io_List implies a sequence of bytes.

        The encoding option says what character encoding to use for
        converting those bytes to characters.  The default encoding
        is UTF-8.  All encodings supported elsewhere in Erlang should
        be supported here.

        The {float,true} option says to convert all JSON numbers to
        Erlang floats, even if they look like integers.
        The {float,false} option says to convert integers to integers;
        it is the default.

        The {binary,true} option says to convert all JSON strings
        to Erlang binaries, even if they are keys in key:value pairs.
        The {binary,false} option says to convert keys to atoms if
        possible; it is the default.

        Other options may be added in the future.

        The mapping from JSON to Erlang is described below in this
        section.  An argument that is not a well formed Io_List,
        or that cannot be decoded, or that when decoded does not
        follow the rules of JSON syntax, results in a badarg
        exception.  [It would be nice if there were Erlang-wide
        conventions for distinguishing these cases.]

    erlang:term_to_json(Term) -> binary()
    erlang:term_to_json(Term, Option_List) -> Binary()

        Types:
            Term = term()
            Option_List = [Option]
            Option = {encoding,atom()}
                   | {space,int()}
                   | space
                   | {indent,int()}
                   | indent

        This is a function for producing portable JSON.
        It is not intended as a means for encoding arbitrary
        Erlang terms.  Terms that do not fit into the mapping
        scheme described below in this section result in a
        badarg exception.

        term_to_json(X) is equivalent to term_to_json(X, []).

        Converting Erlang terms to JSON results in a (logical)
        character sequence, which is encoded as a sequence of
        bytes, which is returned as a binary.  The default encoding
        is UTF-8; this may be overridden by the encoding option.
        Any encoding supported elsewhere in Erlang should be
        supported here.

        The JSON RFC says that "The names within an object SHOULD be
        unique."  

        There are two options for controlling white space.
        By default, none is generated.

        {space,N}, where N is a non-negative integer, says to
        add N spaces after each colon and comma.
        'space' is equivalent to {space,1}.
        No other space is ever inserted.

        {indent,N}, where N is a non-negative integer, says
        to add a line break and some indentation after each
        comma.  The indentation is N spaces for each enclosing
        [] or {}.  Note that this still does not result in any
        other spaces being added; in particular ] and } will
        not appear at the beginning of lines.
        'indent' is equivalent to {indent,1}.

        Other options may be added in the future.

    Converting JSON to Erlang.

        The keywords null, false, and true are converted to the
        corresponding Erlang atoms.  No other JSON forms are
        converted to atoms.

        A number is converted to an Erlang float if
        - it contains a decimal point, or
        - it contains an exponent, or
        - the option {float,true} was passed.
        A JSON number that looks like an integer will be converted to
        an Erlang integer unless {float,true} was provided.

        A string is converted to a UTF-8-encoded binary,
        except where it occurs as a label in an "object".

        A sequence is converted to an Erlang list.  The elements have
        the same order in the list as in the original sequence.

        An empty "object" is converted to an empty tuple.
        A non-empty "object" is converted to a list of {Key,Value}
        pairs.  Keys in the JSON form are always strings.  A Key
        is converted to an Erlang atom if and only if every character
        in the JSON string can be represented in an atom.  Currently
        this means Latin-1 characters.  Any other key is converted
        to a UTF-8-encoded binary.

        This means that if you read and convert a JSON term now,
        and save the binary somewhere, then read and convert it in
        a later fully-Unicode Erlang, you will find the
        representations different.  However, the order of the pairs
        in a JSON "object" has no significance, and an implementation
        of this specification is free to report them in any order it
        likes (as given, reversed, sorted, sorted by some hash, you
        name it).  Within any particular Erlang version, this
        conversion is a pure function, but different Erlang releases
        may change the order of pairs, so you cannot expect exactly
        the same term from release to release anyway.

        We could insist on a canonical form:  don't convert to atoms,
        and always sort.  However, it seems more useful to use atoms
        whenever we can, and as yet there seems to be no compelling
        reason to insist on any particular order.

    Converting Erlang to JSON.

        The atoms null, false, and true are converted to the
        corresponding JSON keywords.  No other Erlang atoms are
        allowed.

        An Erlang integer is converted to a JSON integer.
        An Erlang float is converted to a JSON float, as precisely
        as practical.

        An Erlang binary that is the UTF-8 representation of some
        Unicode string is converted to a string.  No other binaries
        are allowed.

        An Erlang list whose elements are all {Key,Value} pairs
        where Key is either an atom or a binary that is the UTF-8
        representation of some Unicode string is converted to a
        JSON "object".  The order of the key:value pairs in the
        JSON form is the same as the order of the {Key,Value} pairs 
        in the list.  A list with two equivalent keys is not allowed.

        Any Erlang list that contains an element that is a non-empty
        tuple is not allowed if it would not convert to an "object".

        An Erlang proper list whose elements are not tuples is
        converted to a JSON sequence by converting its elements in
        natural order.

        An improper list is not allowed.

        An empty tuple {} is converted to an empty "object".
        Other tuples are not allowed except as elements of lists
        of {Key,Value} pairs.

        Other Erlang terms are not allowed.  If you want to "tunnel"
        other Erlang terms through JSON, fine, but it is entirely up
        to you to do whatever conversion you want.


Motivation

    As Joe Armstrong put it in his message,
    "JSON seems to be ubiquitous".
    It should not only be supported, it should be supported
    simply, efficiently, and reliably.

    As noted above, http://www.ietf.org/rfc/rfc4627.txt
    defines an application/json Media Type that Erlang
    should be able to handle "out of the box".


Rationale

    First, let's consider conversion.  Round trip conversion fidelity
    (X -> Y -> X should be an identity function) is always nice.  Can
    we have it?

    JSON has
        - null
        - false
        - true
        - number (integers, floats, and ratios are not distinguished)
        - string
        - sequence (called array)
        - record (called object)
    Erlang has
        - atom
        - number (integers and floats are distinguished)
        - binary
        - list
        - tuple
        - pid
        - port
        - reference
        - fun

    Clearly, Erlang->JSON->Erlang is going to be tricky.  To take
    just one minor point, neither www.json.org nor RFC 4627 makes
    an promises whatever about the range of numbers that can be
    passed through JSON.  There isn't even any minimum range.  It
    seems as though a JSON implementation could reject all numbers
    other than 0 as too large and still conform!  This is stupid.
    We can PROBABLY rely on IEEE doubles; we almost certainly cannot
    expect to get large integers through JSON.

    Converting pids, ports, and references to textual form using
    pid_to_list/1, erlang:port_to_list/1, and erlang:ref_to_list/1
    is possible.  A built in function can certainly convert back
    from textual form if we want it to.  The problem is telling these
    strings from other strings:  when is "<0.43.0>" a pid and when is
    it a string?  As for funs, let's not go there.

    Basically, converting Erlang terms to JSON so that they can be
    reconstructed as the same (or very similar) Erlang terms would
    involve something like this:

        atom -> string
        number -> number
        binary -> {"type":"binary", "data":[<bytes>]}
        list   -> <list>, if it's a proper list
        list   -> {"type":"dotted", "data":<list>, "end":<last cdr>}
        tuple  -> {"type":"tuple",  "data":<tuple as list>}
        pid    -> {"type":"pid",    "data":<pid as string>}
        port   -> {"type":"port",   "data":<port as string>}
        ref    -> {"type":"ref",    "data":<ref as string>}
        fun    -> {"module":<m>, "name":<n>, "arity":<a>}
        fun    -> we're pushing things a bit for anything else.

    This is not part of the specification because I am not proposing
    JSON as a representation for arbitrary Erlang data.  I am making
    the point that we COULD represent (most) Erlang data in JSON if
    we really wanted to, but it is not an easy or natural fit.  For
    that we have Erlang binary format and we have UBF.  To repeat,
    we have no reason to believe that a JSON->JSON copier that works
    by decoding JSON to an internal form and recoding it for output
    will preserve Erlang terms, even encoded like this.

    No, the point of JSON support in Erlang is to let Erlang programs
    deal with the JSON data that other people are sending around the
    net, and to send JSON data to other programs (like scripts in Web
    browsers) that are expecting plain old JSON.  The round trip
    conversion we need to care about is JSON -> Erlang -> JSON.

    Here too we run into problems.  The obvious way to represent
    {"a":A, "b":B} in Erlang is [{'a',A},{'b',B}], and the obvious
    way to represent a string is as a list of characters.  But in
    JSON, an empty list, an empty "object", and an empty string are
    all clearly distinct, so must be translated to different Erlang
    terms.  Bearing this in mind, here's a first cut at mapping
    JSON to Erlang:

        - null => the atom 'null'
        - false => the atom 'false'
        - true => the atom 'true'
        - number => a float if there is a decimal point or exponent,
                 => an integer otherwise
        - string => a UTF-8-encoded binary
        - sequence => a list
        - record => a list of {Key,Value} pairs
                 => the empty tuple {} for an empty {} object

    Since Erlang does not currently allow the full range of
    Unicode characters in an atom, a Key should be an atom if
    each character of a label fits in Latin 1, or a binary if
    it does not.

    Suppose you are receiving JSON data from a source that does
    not distinguish between integers and floating point numbers?
    Perl, for example, or even more obviously, Javascript itself.
    In that case some floating point numbers may have been written
    in integer style more or less accidentally.  In such a case, you
    may want all the numbers in a JSON form converted to Erlang
    floats.  {float,true} was provided for that purpose.

    The corresponding mapping from Erlang to JSON is

        - atom => itself if it is null, false, or true
               => error otherwise
        - number => itself; use full precision for floats
        - binary => if the binary is a well formed UTF-8 encoding
                    of some string, that string
                 => error otherwise
        - list => an error if it is not proper (ends with non-[])
               => if non-empty and all elements are {Key,Value}
                  tuples, then {Key1:Value1, ..., Keyn:Valuen}
               => if any element is a tuple, an error
               => if it is proper, itself as a sequence
        - {} => an empty "object"
        - otherwise, an error

    There is an issue here with keys.  The RFC says that "The names
    within an object SHOULD be unique."  In the spirit of "be
    generous in what you accept, strict in what you generate", we
    really ought to check that.  The only time term_to_json/[1,2]
    terminate successfully should be when the output is absolutely
    perfect JSON.  I did toy with the idea of an option to allow
    duplicate labels, but if I want to send such non-standard data,
    who can I send it to?  Another Erlang program?  Then I would be
    better to use external binary format.  So the only options now
    allowed are ones to affect white space.  One might add an
    option later to specify the order of key:value pairs somehow,
    but options that do not affect the semantics are appropriate.

    How should the error be reported?  There are two ways to
    report such errors in Erlang:  raise 'badarg' exceptions,
    or return either {ok,Result} or {error,Reason} answers.
    I'm really not at all sure what to do here.  I ended up
    with 'raise badarg' because that's what things like
    binary_to_term/1 do.
   
    There are three "round trip" issues left:

    - all information about white space is lost

    - decimal->binary->decimal conversion of floating point numbers
      may introduce error unless techniques like those described in    
      the Scheme report are used to do these conversions with high
      accuracy.  This is a general problem for Erlang, and a general
      problem for JSON.

    - conversion of a string to a binary and then a binary to a
      string will not always yield the same representation, but
      what you get will represent the same string.  Example,
      "\0041" will read as <<65>> which will display as "A".

    There is one important issue for JSON generation, and that is
    what white space should be generated.  Since JSON is supposed to
    be "human readable", it would be nice if it could be indented,
    and if it could be kept to a reasonable line width.  However,
    appearances to the contrary, JSON has to be regard as a binary
    format.  There is no way to insert line breaks inside strings.
    Javascript doesn't have any analogue of C's <backslash><newline>
    continuation; it can always join the pieces with '+'.  JSON has
    inherited the lack (no line continuation) but not the remedy
    (you may not use '+' in JSON).  So a JSON form containing a
    1000-character string cannot be fitted into 80-column lines;
    it just cannot be done.

    The main thing I have not accounted for is the {binary,true}
    option of json_to_term/2.  For normal Erlang purposes, it is
    much nicer (and somewhat more efficient) to deal with

        [{name,<<"fred">>},{female,false},{age,65}]

    than with

        [{<<"name">>,<<"fred">>},{<<"female">>,false},{<<"age">>,65}]

    If you are communicating with a trusted source that deals with
    a known small number of labels, fine.  There are limits on the
    number of atoms Erlang can deal with.  A small test program
    that looped creating atoms and putting them into a list ticked
    over happily until shortly after its millionth atom, and then
    hung there burning cycles apparently getting nowhere.  Also,
    the atom table is shared by all processes on an Erlang node,
    so garbage collecting it is not as cheap as it might be.  As
    a system integrity measure, therefore, it is useful to have a
    mode of operation in which json_to_term never creates atoms.
    Whether the default behaviour should be "safe" or "readable"
    really depends on whether you intend to accept JSON from untrusted    
    sources or not.  I've chosen the default to be what I would want
    to use most of the time, but this is after all only a proposal.


Backwards Compatibility

    There are no term_to_json/N or json_to_term/N functions in
    the erlang: module now, so adding them should not break
    anything.  These functions will NOT be automatically imported;
    it will be necessary to use an explicit erlang: prefix.  So
    any existing code that uses these function names won't notice
    any change.


Reference Implementation

    None.


References
    
    None.


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: