[erlang-questions] binary typed schema-less protocol
Richard A. O'Keefe
ok@REDACTED
Mon Jul 29 03:10:48 CEST 2013
On 26/07/2013, at 10:55 PM, Motiejus Jakštys wrote:
> Found it.
>
> http://tnetstrings.org/
May I suggest *not* using that approach?
Let's start with size. I have a small JSON test collection.
Here's the result of reading each element of that collection
and
json - writing as JSON with a newline after every comma and
a space after every colon, but no indentation
xml - using <n/>, <t/>, <f/>, <v>number</v>, <s>string</s>,
<a>e1 ... en</a> or <d>e1 ... en</d>, with the key
strings appearing as key='..' attributes on the children
of <d> elements
cson - described below, using base 10 for integers
c64 - same as cson, but using base 64 for integers
tns - using the encoding described at tnetstrings.org
json xml cson c64 tns
9 31 8 8 16
385 470 341 336 387
201 273 167 165 204
428 545 362 350 422
2864 3262 2676 2550 2903
680 905 543 534 659
25 57 22 17 34
258 356 204 204 249
33 48 27 27 33
529 715 421 410 515
1192 1621 939 911 1161
192 255 160 154 191
2425 3289 1853 1807 2378
23 30 20 12 26
23 30 20 12 26
32 52 13 12 39
50 82 37 37 50
42 109 35 26 61
7 10 5 5 6
I'm rather pleased that an encoding (cson) that I threw together
in a couple of minutes handily beats the rest, but not surprised.
There is a simple blunder in the TNetStrings design that causes
serious inefficiency if you try to transport nontrivial
data that way: the "type" code is at the wrong end.
You have to read an entire object before you can start
decoding it, which is just plain silly. _And_ it is hard
to transmit floating-point numbers accurately.
Not only that, you cannot stream the output. There is a
JSONGenerator class recently added to Java so that you can
stream large amounts of data out without actually having to
hold much in memory; this is a need people genuinely have.
Let's just look at the output code for three techniques, taken
from my Smalltalk library. To keep it simple, let's just look
at arrays.
printJsonOn: aStream
aStream nextPut: $[.
self do: [:each | each printJsonOn: aStream]
separatedBy: [aStream nextPut: $,; cr].
aStream nextPut: $].
Output goes directly to the output stream with NO
intermediate objects created. You can stream this
without knowing the size of the virtual array until
the end.
printCsonOn: aStream
self size printOn: aStream.
aStream nextPut: $[.
self do: [:each | each printCsonOn: aStream].
Output goes directly to the output stream with NO
intermediate objects created. You can stream this
as long as you know the size of the virtual array
at the beginning.
printTNetStringOn: aStream
|s|
s := StringBuffer new: self size * 6.
self do: [:each | each printTNetStringOn: s].
s size printOn: aStream.
aStream nextPut: $:; nextPutAll: s; nextPut: $].
OUCH! You have to convert every element to a string,
concatenate them, then write the size of the *string*
(not the *array*), and then the string. The whole
thing has to be held in memory as a string. You cannot
stream this.
And having paid the heavy cost of building the output, you get
no special benefit from the input. Yes, you can preallocate
*strings*, but since you are never told the size of *arrays* or
*objects*, you cannot preallocate them.
What then is this "cson"?
It is the TNetString approach with three simple fixes:
(1) Type information goes at the BEGINNING, not the end.
(2) Sizes for arrays and objects are the element counts of the
arrays and objects, NOT the character counts (still less
the byte counts) of the strings that represent them,
so you can preallocate and stream input. For example,
given a path to an item, you could decode just that item
without allocating *any* space for unwanted stuff -- though
you would still have to decode it.
(3) Floats are represented as integers times a power of 2
so that they can easily be transported without rounding.
The output is a byte sequence; where characters appear they
are to be encoded using UTF-8.
<number>"+" a positive integer
<number>"-" a negative integer
<number>"*"<number>"+" a positive float with positive exponent
<number>"*"<number>"-" a negative float with positive exponent
<number>"/"<number>"+" a positive float with negative exponent
<number>"/"<number>"-" a negative float with negative exponent
<number>">" positive infinity or NaN
<number>"<" negative infinity or NaN
<number>"#" false
<number>"=" true
<number>"!" null
<number>"<chars> a Unicode string
<number>"["(<item>)* a sequence
<number>"{"(<key><item>)* a dictionary
where <number> is a possibly empty sequence of decimal digits
with no leading zeros. (In particular, 0 -> "+", not "0+".)
This means that null, false, true come out as "!", "#", "=".
For extra compactness, numbers could be encoded in base 64,
using the digits 0-9A-Za-z$@,
but experimentally, that only saves a couple of percent.
The numbers reported above use decimal encoding.
There is no restriction here that the keys of an "object" can be only
strings; that's for a higher level protocol to decide. If sender and
receiver can both handle more general dictionaries, why not?
<n>*<m> stands for <n>*(2**<m>) and <n>/<m> stands for <n>/(2**<m>).
The representation is unique: either <n> is empty and <m> is
empty (using "*+" for +0.0 and "*-" for -0.0) or <n> is an odd
integer. ">" and "<" are +infinity and -infinity respectively;
other leading numbers indicate NaNs.
TNetStrings claims the following advantages:
1. Trivial to parse in every language without making errors.
FALSE. To parse an array, you have to first read in the
whole string, and then recursively decode it. You can't
decode as you go because you don't find out what it _is_
until you read the end.
2. Resistant to buffer overflows and other problems.
MISLEADING. Whether there can be buffer overflows depends
on the implementation. My JSON parser is not subject to
buffer overflows, and I can't think why any competent programmer's
JSON parser would be.
3. Fast and low resource intensive.
FALSE.
4. Makes no assumptions about string contents and can store binary data
without escaping or encoding them.
UNCLEAR. Dan Bernstein's netstrings proposal was specific to 8-bit
characters. It was unsuitable for transmitting text between 8-bit
systems using different code pages. TNetStrings says that all
counts are *byte* counts and all data are *byte* sequences, which
leaves the handling of Unicode text -- essential if this is to serve
where JSON serves -- totally unclear. If Unicode text is transmitted
in UTF-8, then it *cannot* store binary data without escaping or
encoding. This may even be FALSE, because floating point numbers
count as binary data in my book, and certainly have to be encoded in
this format. The idea of representing 1.23e-20 as
0.0000000000000000000123 strikes me as gratuitously odd; you don't
want to see 1.2345e300 !
5. Backward compatible with original netstrings.
TRUE, but so what?
6. Transport agnostic, so it works with streams, messages, files,
anything that's 8-bit clean.
TRUEish. There is an unstated assumption that we are dealing
with *byte* streams, not *text* streams, which is what makes
claim 4 almost certainly false. In any case, JSON also has
this property, and so does CSON.
CSON claims the following advantages:
1. Easy to generate, including streaming output as long as you
know array/object element counts when you start to write,
creating *no* intermediate data structures.
2. Easy to read, creating *no* intermediate data structures.
3. Handles Unicode.
4. Handles floating point numbers precisely as long as the
receiver is able to hold the numbers.
5. Encoded data are byte streams, just like JSON.
and the following disadvantage:
6. You can skip an unwanted item without *allocating* it but
not without *decoding* it.
More information about the erlang-questions
mailing list