[erlang-questions] binary typed schema-less protocol

Mon Jul 29 03:10:48 CEST 2013

On 26/07/2013, at 10:55 PM, Motiejus Jakštys wrote:
> Found it.
> 
> http://tnetstrings.org/

May I suggest *not* using that approach?

Let's start with size.  I have a small JSON test collection.
Here's the result of reading each element of that collection
and
  json - writing as JSON with a newline after every comma and
         a space after every colon, but no indentation
  xml  - using <n/>, <t/>, <f/>, <v>number</v>, <s>string</s>,
         <a>e1 ... en</a> or <d>e1 ... en</d>, with the key
         strings appearing as key='..' attributes on the children
         of <d> elements
  cson - described below, using base 10 for integers
  c64  - same as cson, but using base 64 for integers
  tns  - using the encoding described at tnetstrings.org

json	xml	cson	c64	tns
9	31	8	8	16
385	470	341	336	387
201	273	167	165	204
428	545	362	350	422
2864	3262	2676	2550	2903
680	905	543	534	659
25	57	22	17	34
258	356	204	204	249
33	48	27	27	33
529	715	421	410	515
1192	1621	939	911	1161
192	255	160	154	191
2425	3289	1853	1807	2378
23	30	20	12	26
23	30	20	12	26
32	52	13	12	39
50	82	37	37	50
42	109	35	26	61
7	10	5	5	6

I'm rather pleased that an encoding (cson) that I threw together
in a couple of minutes handily beats the rest, but not surprised.

There is a simple blunder in the TNetStrings design that causes
serious inefficiency if you try to transport nontrivial
data that way:  the "type" code is at the wrong end.

You have to read an entire object before you can start
decoding it, which is just plain silly.  _And_ it is hard
to transmit floating-point numbers accurately.

Not only that, you cannot stream the output.  There is a
JSONGenerator class recently added to Java so that you can
stream large amounts of data out without actually having to
hold much in memory; this is a need people genuinely have.

Let's just look at the output code for three techniques, taken
from my Smalltalk library.  To keep it simple, let's just look
at arrays.

    printJsonOn: aStream
      aStream nextPut: $[.
      self do: [:each | each printJsonOn: aStream]
           separatedBy: [aStream nextPut: $,; cr].
      aStream nextPut: $].

	Output goes directly to the output stream with NO
	intermediate objects created.  You can stream this
	without knowing the size of the virtual array until
	the end.

    printCsonOn: aStream
      self size printOn: aStream.
      aStream nextPut: $[.
      self do: [:each | each printCsonOn: aStream].

	Output goes directly to the output stream with NO
	intermediate objects created.  You can stream this
	as long as you know the size of the virtual array
	at the beginning.

    printTNetStringOn: aStream
      |s|
      s := StringBuffer new: self size * 6.
      self do: [:each | each printTNetStringOn: s].
      s size printOn: aStream.
      aStream nextPut: $:; nextPutAll: s; nextPut: $].

	OUCH!  You have to convert every element to a string,
	concatenate them, then write the size of the *string*
	(not the *array*), and then the string.  The whole
	thing has to be held in memory as a string.  You cannot
	stream this.

And having paid the heavy cost of building the output, you get
no special benefit from the input.  Yes, you can preallocate
*strings*, but since you are never told the size of *arrays* or
*objects*, you cannot preallocate them.

What then is this "cson"?

It is the TNetString approach with three simple fixes:
(1) Type information goes at the BEGINNING, not the end.
(2) Sizes for arrays and objects are the element counts of the
    arrays and objects, NOT the character counts (still less
    the byte counts) of the strings that represent them,
    so you can preallocate and stream input.  For example,
    given a path to an item, you could decode just that item
    without allocating *any* space for unwanted stuff -- though
    you would still have to decode it.
(3) Floats are represented as integers times a power of 2
    so that they can easily be transported without rounding.
The output is a byte sequence; where characters appear they
are to be encoded using UTF-8.

	<number>"+"			a positive integer
	<number>"-"			a negative integer
	<number>"*"<number>"+"		a positive float with positive exponent
	<number>"*"<number>"-"  	a negative float with positive exponent
	<number>"/"<number>"+"  	a positive float with negative exponent
	<number>"/"<number>"-"  	a negative float with negative exponent
	<number>">"			positive infinity or NaN
	<number>"<"			negative infinity or NaN
	<number>"#"			false
	<number>"="			true
	<number>"!"			null
	<number>"<chars>		a Unicode string
	<number>"["(<item>)*		a sequence
	<number>"{"(<key><item>)*	a dictionary

where <number> is a possibly empty sequence of decimal digits
with no leading zeros.  (In particular, 0 -> "+", not "0+".)
This means that null, false, true come out as "!", "#", "=".
For extra compactness, numbers could be encoded in base 64,
using the digits 0-9A-Za-z$@,
but experimentally, that only saves a couple of percent.
The numbers reported above use decimal encoding.

There is no restriction here that the keys of an "object" can be only
strings; that's for a higher level protocol to decide.  If sender and
receiver can both handle more general dictionaries, why not?

<n>*<m> stands for <n>*(2**<m>) and <n>/<m> stands for <n>/(2**<m>).
The representation is unique:  either <n> is empty and <m> is
empty (using "*+" for +0.0 and "*-" for -0.0) or <n> is an odd
integer. ">" and "<" are +infinity and -infinity respectively;
other leading numbers indicate NaNs.

TNetStrings claims the following advantages:

1.  Trivial to parse in every language without making errors.

    FALSE.  To parse an array, you have to first read in the
    whole string, and then recursively decode it.  You can't
    decode as you go because you don't find out what it _is_
    until you read the end.

2.  Resistant to buffer overflows and other problems.

    MISLEADING.  Whether there can be buffer overflows depends
    on the implementation.  My JSON parser is not subject to
    buffer overflows, and I can't think why any competent programmer's
    JSON parser would be.

3.  Fast and low resource intensive.

    FALSE.

4.  Makes no assumptions about string contents and can store binary data
    without escaping or encoding them.

    UNCLEAR.  Dan Bernstein's netstrings proposal was specific to 8-bit
    characters.  It was unsuitable for transmitting text between 8-bit
    systems using different code pages.  TNetStrings says that all
    counts are *byte* counts and all data are *byte* sequences, which
    leaves the handling of Unicode text -- essential if this is to serve
    where JSON serves -- totally unclear.  If Unicode text is transmitted
    in UTF-8, then it *cannot* store binary data without escaping or
    encoding.  This may even be FALSE, because floating point numbers
    count as binary data in my book, and certainly have to be encoded in
    this format.  The idea of representing 1.23e-20 as
    0.0000000000000000000123 strikes me as gratuitously odd; you don't
    want to see 1.2345e300 !

5.  Backward compatible with original netstrings.

    TRUE, but so what?

6.  Transport agnostic, so it works with streams, messages, files,
    anything that's 8-bit clean.

    TRUEish.  There is an unstated assumption that we are dealing
    with *byte* streams, not *text* streams, which is what makes
    claim 4 almost certainly false.  In any case, JSON also has
    this property, and so does CSON.

CSON claims the following advantages:

1.  Easy to generate, including streaming output as long as you
    know array/object element counts when you start to write,
    creating *no* intermediate data structures.

2.  Easy to read, creating *no* intermediate data structures.

3.  Handles Unicode.

4.  Handles floating point numbers precisely as long as the
    receiver is able to hold the numbers.

5.  Encoded data are byte streams, just like JSON.

and the following disadvantage:

6.  You can skip an unwanted item without *allocating* it but
    not without *decoding* it.