EEP: XXX Title: JSON bifs Version: $Revision: 14 $ Last-Modified: $Date: 2007-06-29 16:24:01 +0200 (Fri, 29 Jun 2007) $ Author: Richard A. O'Keefe Status: Draft Type: Standards Track Erlang-Version: R12B-4 Content-Type: text/plain Created: 28-Jul-2008 Post-History: Abstract "JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate." (Opening text of the http://www.json.org web site.) JSON is specified by RFC 4627, which defines a Media Type application/json. There are JSON libraries for a wide range of languages, so it is a useful format. There are already JSON bindings for Erlang, but on the 24th of July 2008, Joe Armstrong suggested that it would be worth having built in functions to convert Erlang terms to and from the JSON format. term_to_json -- convert a term to JSON form json_to_term -- convert a JSON form to Erlang Specification Three new types are added to the vocabulary of well known types to be used in edoc. @type json_label() = atom() | binary(). @type json(L, N) = null | false | true | N | [json(L, N)] | tuple({L, json(L, N)}). @type json() = json(json_label(), number()). Four new functions are added to the erlang: module. erlang:json_to_term(IO_Data) -> json() erlang:json_to_term(IO_Data, Option_List) -> json() Types: IO_Data = iodata() Option_List = [Option] Option = {encoding,atom()} | {float,bool()} | {label,existing_atom|atom|binary} json_to_term(X) is equivalent to json_to_term(X, []). The IO_Data implies a sequence of bytes. The encoding option says what character encoding to use for converting those bytes to characters. The default encoding is UTF-8. All encodings supported elsewhere in Erlang should be supported here. The {float,true} option says to convert all JSON numbers to Erlang floats, even if they look like integers. With this option, the result has type json(L, float()). The {float,false} option says to convert integers to integers; it is the default. With this option, the result has type json(L, number()). The {label,binary} option says to convert all JSON strings to Erlang binaries, even if they are keys in key:value pairs. With this option, the result has type json(binary(), N). The {label,atom} option says to convert keys to atoms if possible, leaving other strings as binaries. With this option, the result has type json(json_label(), N). The {label,existing_atom} option says to convert keys to atoms if the atoms already exist, leaving other keys as binaries. All other strings remain binaries too. With this option, the result has type json(json_label(), N). The default is {label,existing_atom}, combining safety and convenience. Other options may be added in the future. The mapping from JSON to Erlang is described below in this section. An argument that is not a well formed IO_Data, or that cannot be decoded, or that when decoded does not follow the rules of JSON syntax, results in a badarg exception. [It would be nice if there were Erlang-wide conventions for distinguishing these cases.] erlang:term_to_json(JSON) -> binary() erlang:term_to_json(JSON, Option_List) -> Binary() Types: JSON = json() Option_List = [Option] Option = {encoding,atom()} | {space,int()} | space | {indent,int()} | indent This is a function for producing portable JSON. It is not intended as a means for encoding arbitrary Erlang terms. Terms that do not fit into the mapping scheme described below in this section result in a badarg exception. The JSON RFC says that "The names within an object SHOULD be unique." JSON terms that violate this should also result in a badarg exception. term_to_json(X) is equivalent to term_to_json(X, []). Converting Erlang terms to JSON results in a (logical) character sequence, which is encoded as a sequence of bytes, which is returned as a binary. The default encoding is UTF-8; this may be overridden by the encoding option. Any encoding supported elsewhere in Erlang should be supported here. There are two options for controlling white space. By default, none is generated. {space,N}, where N is a non-negative integer, says to add N spaces after each colon and comma. 'space' is equivalent to {space,1}. No other space is ever inserted. {indent,N}, where N is a non-negative integer, says to add a line break and some indentation after each comma. The indentation is N spaces for each enclosing [] or {}. Note that this still does not result in any other spaces being added; in particular ] and } will not appear at the beginning of lines. 'indent' is equivalent to {indent,1}. Other options may be added in the future. Converting JSON to Erlang. The keywords null, false, and true are converted to the corresponding Erlang atoms. No other complete JSON forms are converted to atoms. A number is converted to an Erlang float if - it contains a decimal point, or - it contains an exponent, or - the option {float,true} was passed. A JSON number that looks like an integer will be converted to an Erlang integer unless {float,true} was provided. A string is converted to a UTF-8-encoded binary, except where it occurs as a label in an "object". A sequence is converted to an Erlang list. The elements have the same order in the list as in the original sequence. An "object" is converted to a tuple of {Key,Value} pairs. Keys in the JSON form are always strings. A Key is converted to an Erlang atom if and only if (every character in the JSON string can be represented in an atom) and (the 'binary' option is not specified as true). Currently this means that names made of Latin-1 characters will be atoms. Any other key is converted to a UTF-8-encoded binary. This means that if you read and convert a JSON term now, and save the binary somewhere, then read and convert it in a later fully-Unicode Erlang, you will find the representations different. However, the order of the pairs in a JSON "object" has no significance, and an implementation of this specification is free to report them in any order it likes (as given, reversed, sorted, sorted by some hash, you name it). Within any particular Erlang version, this conversion is a pure function, but different Erlang releases may change the order of pairs, so you cannot expect exactly the same term from release to release anyway. We could insist on a canonical form: don't convert to atoms, and always sort. However, it seems more useful to use atoms whenever we can, and as yet there seems to be no compelling reason to insist on any particular order. In the spirit of "be generous in what you accept, strict in what you produce", it might be a good idea to accept unquoted labels in the input. You can't accept just any old junk, but allowing anything that looks like an unquoted Erlang atom or an Erlang variable would make sense. I am informed that there are JSON generators out there that do this, so allowing this will make the interface more useful. Converting Erlang to JSON. The atoms null, false, and true are converted to the corresponding JSON keywords. No other Erlang atoms are allowed. An Erlang integer is converted to a JSON integer. An Erlang float is converted to a JSON float, as precisely as practical. An Erlang float which has an integral value is written in such a way that it will read back as a float; suitable methods include suffixing ".0" or "e0". An Erlang binary that is the UTF-8 representation of some Unicode string is converted to a string. No other binaries are allowed. An Erlang tuple whose elements are all {Key,Value} pairs is converted to a JSON object. Each Key should be either an atom or a binary that is the UTF-8 representation of some Unicode string. The order of the key:value pairs in the JSON form is the same as the order of the {Key,Value} pairs in the list. A tuple with two equivalent keys is not allowed. Two binaries, or two atoms, are equivalent iff they are equal. An atom and a binary are equivalent if they would convert to the same JSON string. No other tuples are allowed. An Erlang proper list is converted to a JSON sequence by converting its elements in natural order. An improper list is not allowed. Other Erlang terms are not allowed. If you want to "tunnel" other Erlang terms through JSON, fine, but it is entirely up to you to do whatever conversion you want. Motivation As Joe Armstrong put it in his message, "JSON seems to be ubiquitous". It should not only be supported, it should be supported simply, efficiently, and reliably. As noted above, http://www.ietf.org/rfc/rfc4627.txt defines an application/json Media Type that Erlang should be able to handle "out of the box". Rationale First, let's consider conversion. Round trip conversion fidelity (X -> Y -> X should be an identity function) is always nice. Can we have it? JSON has - null - false - true - number (integers, floats, and ratios are not distinguished) - string - sequence (called array) - record (called object) Erlang has - atom - number (integers and floats are distinguished) - binary - list - tuple - pid - port - reference - fun Clearly, Erlang->JSON->Erlang is going to be tricky. To take just one minor point, neither www.json.org nor RFC 4627 makes an promises whatever about the range of numbers that can be passed through JSON. There isn't even any minimum range. It seems as though a JSON implementation could reject all numbers other than 0 as too large and still conform! This is stupid. We can PROBABLY rely on IEEE doubles; we almost certainly cannot expect to get large integers through JSON. Converting pids, ports, and references to textual form using pid_to_list/1, erlang:port_to_list/1, and erlang:ref_to_list/1 is possible. A built in function can certainly convert back from textual form if we want it to. The problem is telling these strings from other strings: when is "<0.43.0>" a pid and when is it a string? As for funs, let's not go there. Basically, converting Erlang terms to JSON so that they can be reconstructed as the same (or very similar) Erlang terms would involve something like this: atom -> string number -> number binary -> {"type":"binary", "data":[]} list -> , if it's a proper list list -> {"type":"dotted", "data":, "end":} tuple -> {"type":"tuple", "data":} pid -> {"type":"pid", "data":} port -> {"type":"port", "data":} ref -> {"type":"ref", "data":} fun -> {"module":, "name":, "arity":} fun -> we're pushing things a bit for anything else. This is not part of the specification because I am not proposing JSON as a representation for arbitrary Erlang data. I am making the point that we COULD represent (most) Erlang data in JSON if we really wanted to, but it is not an easy or natural fit. For that we have Erlang binary format and we have UBF. To repeat, we have no reason to believe that a JSON->JSON copier that works by decoding JSON to an internal form and recoding it for output will preserve Erlang terms, even encoded like this. No, the point of JSON support in Erlang is to let Erlang programs deal with the JSON data that other people are sending around the net, and to send JSON data to other programs (like scripts in Web browsers) that are expecting plain old JSON. The round trip conversion we need to care about is JSON -> Erlang -> JSON. Here too we run into problems. The obvious way to represent {"a":A, "b":B} in Erlang is [{'a',A},{'b',B}], and the obvious way to represent a string is as a list of characters. But in JSON, an empty list, an empty "object", and an empty string are all clearly distinct, so must be translated to different Erlang terms. Bearing this in mind, here's a first cut at mapping JSON to Erlang: - null => the atom 'null' - false => the atom 'false' - true => the atom 'true' - number => a float if there is a decimal point or exponent, => an integer otherwise - string => a UTF-8-encoded binary - sequence => a list - object => a list of {Key,Value} pairs => the empty tuple {} for an empty {} object Since Erlang does not currently allow the full range of Unicode characters in an atom, a Key should be an atom if each character of a label fits in Latin 1, or a binary if it does not. Let's examine "objects" a little more closely. Erlang programmers are used to working with lists of {Key,Value} pairs. The standard library even include orddict, which works with just such lists (although they must be sorted). However, there is something distasteful about having empty objects convert to empty tuples, but non-empty objects to empty lists, and there is also something distasteful about lists converting to sequence or objects depending on what is inside them. What is distasteful here has something to do with TYPES. Erlang doesn't have static types, but that does not mean that types are not useful as a design tool, or that something resembling type consistency is not useful to people. The fact that Erlang tuples happen to use curly braces is just icing on the cake. The first draft of this EEP used lists; that was entirely R.A.O'K's own work. It was then brought to his attention that Joe Armstrong thought converting "objects" to tuples was the right thing to do. The deciding factor was the possibility of defining @json/2. Suppose you are receiving JSON data from a source that does not distinguish between integers and floating point numbers? Perl, for example, or even more obviously, Javascript itself. In that case some floating point numbers may have been written in integer style more or less accidentally. In such a case, you may want all the numbers in a JSON form converted to Erlang floats. {float,true} was provided for that purpose. The corresponding mapping from Erlang to JSON is - atom => itself if it is null, false, or true => error otherwise - number => itself; use full precision for floats - binary => if the binary is a well formed UTF-8 encoding of some string, that string => error otherwise - tuple => if all elements are {Key,Value} pairs with non-equivalent keys, then a JSON "object", => error otherwise - list => if it is proper, itself as a sequence => error otherwise - otherwise, an error There is an issue here with keys. The RFC says that "The names within an object SHOULD be unique." In the spirit of "be generous in what you accept, strict in what you generate", we really ought to check that. The only time term_to_json/[1,2] terminate successfully should be when the output is absolutely perfect JSON. I did toy with the idea of an option to allow duplicate labels, but if I want to send such non-standard data, who can I send it to? Another Erlang program? Then I would be better to use external binary format. So the only options now allowed are ones to affect white space. One might add an option later to specify the order of key:value pairs somehow, but options that do not affect the semantics are appropriate. How should the error be reported? There are two ways to report such errors in Erlang: raise 'badarg' exceptions, or return either {ok,Result} or {error,Reason} answers. I'm really not at all sure what to do here. I ended up with 'raise badarg' because that's what things like binary_to_term/1 do. There are four "round trip" issues left: - all information about white space is lost - decimal->binary->decimal conversion of floating point numbers may introduce error unless techniques like those described in the Scheme report are used to do these conversions with high accuracy. This is a general problem for Erlang, and a general problem for JSON. - there is another JSON library for Erlang that always converts integers outside the 32-bit range to floating point. This seems like a bad idea. There are languages (Scheme, Common Lisp, SWI Prolog, Smalltalk) with JSON libraries that have bignums. Why put an arbitrary restriction on our ability to communication with them? Any JSON implementation that is unable to cope with large integers as integers is (or should be) perfectly able to convert such numbers to floating-point for itself. It seems speciallly silly to do this when you consider that the program on the other end might itself be in Erlang. So we expect that if T is of type json(binary(),integer()) then json_to_term(term_to_json(T), [{label,binary}]) should be identical to T. - conversion of a string to a binary and then a binary to a string will not always yield the same representation, but what you get will represent the same string. Example, "\0041" will read as <<65>> which will display as "A". There is one important issue for JSON generation, and that is what white space should be generated. Since JSON is supposed to be "human readable", it would be nice if it could be indented, and if it could be kept to a reasonable line width. However, appearances to the contrary, JSON has to be regard as a binary format. There is no way to insert line breaks inside strings. Javascript doesn't have any analogue of C's continuation; it can always join the pieces with '+'. JSON has inherited the lack (no line continuation) but not the remedy (you may not use '+' in JSON). So a JSON form containing a 1000-character string cannot be fitted into 80-column lines; it just cannot be done. The main thing I have not accounted for is the {label,_}. option of json_to_term/2. For normal Erlang purposes, it is much nicer (and somewhat more efficient) to deal with {{name,<<"fred">>},{female,false},{age,65}} than with {{<<"name">>,<<"fred">>},{<<"female">>,false},{<<"age">>,65}} If you are communicating with a trusted source that deals with a known small number of labels, fine. There are limits on the number of atoms Erlang can deal with. A small test program that looped creating atoms and putting them into a list ticked over happily until shortly after its millionth atom, and then hung there burning cycles apparently getting nowhere. Also, the atom table is shared by all processes on an Erlang node, so garbage collecting it is not as cheap as it might be. As a system integrity measure, therefore, it is useful to have a mode of operation in which json_to_term never creates atoms. But Erlang offers a third possibility: there is a built-in list_to_existing_atom/1 function that returns an atom only if that atom already exists. Otherwise it raises an exception. So there are three cases: {label,binary} Always convert labels to binaries. This is always safe and always clumsy. Since <<"xxx">> syntax exists in Erlang, it isn't _that_ clumsy. {label,atom} Always convert labels to atoms if all their characters are allowed in atoms, leave them as binaries otherwise. This is more convenient for Erlang programming. However, it is only really usable with a partner that you trust. {label,existing_atom} Convert labels that match the names of existing atoms to those atoms, leave all others as binaries. If a module mentions an atom, and goes looking for that atom as a key, it will find it. This is safe _and_ convenient. The only real issue with it is that the same JSON term converted at different times (in the same Erlang node) may be converted differently. This usually won't matter. I have selected 'existing_atom' as the default, but appreciate that reasonable people may well prefer 'binary'. It has been suggested to me that it might be better for the result of term_to_json/[1,2] to be iodata() rather than a binary(). Anything that would have accepted iodata() will be happy with a binary(), so the question is whether it is better for the implementation, whether perhaps there are chunks of stuff that have to be copied using a binary() but can be shared using iodata(). Thanks to the encoding issue, I don't really think so. This might be a good time to point out why the encoding is done here rather than somewhere else. If you know that you are generating stuff that will be encoded into character set X, then you can avoid generating characters that are not in that character set. You can generate \u sequences instead. Of course JSON itself requires UTF-8, but what if you are going to send it through some other transport? With {encoding,ascii} you are out of trouble all the way. So for now I am sticking with binary(). The final issue is whether these functions should go in the erlang: module or in some other module (perhaps called json:). - If another module, then there is no barrier to adding other functions. For example, we might offer functions to test whether a term is a JSON term, or an IO_Data represents a JSON term, or alternative functions that present results in some canonical form. - If another module, then someone looking for a JSON module might find one. - If another module, then this interface can easily be prototyped without any modification to the core Erlang system. - If another module, then someone who doesn't need this feature need not load it. Conversely, - If another module, then it is too easy to bloat the interface. We don't _need_ such testing functions, as we can always catch the badarg exception from the existing ones. We don't _need_ extra canonicalising functions, because we can add options to the existing ones. Something that subtly encourages us to keep the number of functions down is a Good Thing. - Every Erlang programmer ought to be familiar with the erlang: module, and when looking for any feature, ought to start by looking there. - There are JSON implementations in Erlang already; we know what it is like to use such a thing, and we only need to settle the fine details of the implementation. We know that it can be implemented. Now we want something that is always there and always the same and is as efficient as practical. - In particular, we know that the feature is useful, and we know that in applications where it is used, it will be used often, so we want it to go about as fast as term_to_binary/1 and binary_to_term/1. So we'd really like it to be implemented in C, ideally inside the emulator. Erlang does not make dynamic loading of foreign code modules easy. It's a delicate balance. On the whole, I still think that putting these functions in erlang: is a good idea, but more reasons on both sides would be useful. Backwards Compatibility There are no term_to_json/N or json_to_term/N functions in the erlang: module now, so adding them should not break anything. These functions will NOT be automatically imported; it will be necessary to use an explicit erlang: prefix. So any existing code that uses these function names won't notice any change. Reference Implementation None. References (1) The JSON web site, http://www.json.org/ (2) The JSON RFC, http://www.ietf.org/rfc/rfc4627.txt Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: