[erlang-questions] binary typed schema-less protocol

Tony Rogvall tony@REDACTED
Mon Jul 29 13:55:37 CEST 2013


I am also pretty pleased with cson :-)
Looks very nice.

A small question.
How are strings encoded?

a)  <utf8-octet-string-length>"<utf8-chars> 
b)  <number-of-unicode-chars>"<integers>*
c)  <number-of-unicode-chars>"<utf8-char>*
d) Other?

Thanks

/Tony

On 29 jul 2013, at 03:10, Richard A. O'Keefe <ok@REDACTED> wrote:

> 
> On 26/07/2013, at 10:55 PM, Motiejus Jakštys wrote:
>> Found it.
>> 
>> http://tnetstrings.org/
> 
> May I suggest *not* using that approach?
> 
> Let's start with size.  I have a small JSON test collection.
> Here's the result of reading each element of that collection
> and
>  json - writing as JSON with a newline after every comma and
>         a space after every colon, but no indentation
>  xml  - using <n/>, <t/>, <f/>, <v>number</v>, <s>string</s>,
>         <a>e1 ... en</a> or <d>e1 ... en</d>, with the key
>         strings appearing as key='..' attributes on the children
>         of <d> elements
>  cson - described below, using base 10 for integers
>  c64  - same as cson, but using base 64 for integers
>  tns  - using the encoding described at tnetstrings.org
> 
> 
> json	xml	cson	c64	tns
> 9	31	8	8	16
> 385	470	341	336	387
> 201	273	167	165	204
> 428	545	362	350	422
> 2864	3262	2676	2550	2903
> 680	905	543	534	659
> 25	57	22	17	34
> 258	356	204	204	249
> 33	48	27	27	33
> 529	715	421	410	515
> 1192	1621	939	911	1161
> 192	255	160	154	191
> 2425	3289	1853	1807	2378
> 23	30	20	12	26
> 23	30	20	12	26
> 32	52	13	12	39
> 50	82	37	37	50
> 42	109	35	26	61
> 7	10	5	5	6
> 
> 
> I'm rather pleased that an encoding (cson) that I threw together
> in a couple of minutes handily beats the rest, but not surprised.
> 
> There is a simple blunder in the TNetStrings design that causes
> serious inefficiency if you try to transport nontrivial
> data that way:  the "type" code is at the wrong end.
> 
> You have to read an entire object before you can start
> decoding it, which is just plain silly.  _And_ it is hard
> to transmit floating-point numbers accurately.
> 
> Not only that, you cannot stream the output.  There is a
> JSONGenerator class recently added to Java so that you can
> stream large amounts of data out without actually having to
> hold much in memory; this is a need people genuinely have.
> 
> Let's just look at the output code for three techniques, taken
> from my Smalltalk library.  To keep it simple, let's just look
> at arrays.
> 
>    printJsonOn: aStream
>      aStream nextPut: $[.
>      self do: [:each | each printJsonOn: aStream]
>           separatedBy: [aStream nextPut: $,; cr].
>      aStream nextPut: $].
> 
> 
> 	Output goes directly to the output stream with NO
> 	intermediate objects created.  You can stream this
> 	without knowing the size of the virtual array until
> 	the end.
> 
>    printCsonOn: aStream
>      self size printOn: aStream.
>      aStream nextPut: $[.
>      self do: [:each | each printCsonOn: aStream].
> 
> 	Output goes directly to the output stream with NO
> 	intermediate objects created.  You can stream this
> 	as long as you know the size of the virtual array
> 	at the beginning.
> 
>    printTNetStringOn: aStream
>      |s|
>      s := StringBuffer new: self size * 6.
>      self do: [:each | each printTNetStringOn: s].
>      s size printOn: aStream.
>      aStream nextPut: $:; nextPutAll: s; nextPut: $].
> 
> 	OUCH!  You have to convert every element to a string,
> 	concatenate them, then write the size of the *string*
> 	(not the *array*), and then the string.  The whole
> 	thing has to be held in memory as a string.  You cannot
> 	stream this.
> 
> And having paid the heavy cost of building the output, you get
> no special benefit from the input.  Yes, you can preallocate
> *strings*, but since you are never told the size of *arrays* or
> *objects*, you cannot preallocate them.
> 
> What then is this "cson"?
> 
> It is the TNetString approach with three simple fixes:
> (1) Type information goes at the BEGINNING, not the end.
> (2) Sizes for arrays and objects are the element counts of the
>    arrays and objects, NOT the character counts (still less
>    the byte counts) of the strings that represent them,
>    so you can preallocate and stream input.  For example,
>    given a path to an item, you could decode just that item
>    without allocating *any* space for unwanted stuff -- though
>    you would still have to decode it.
> (3) Floats are represented as integers times a power of 2
>    so that they can easily be transported without rounding.
> The output is a byte sequence; where characters appear they
> are to be encoded using UTF-8.
> 
> 	<number>"+"			a positive integer
> 	<number>"-"			a negative integer
> 	<number>"*"<number>"+"		a positive float with positive exponent
> 	<number>"*"<number>"-"  	a negative float with positive exponent
> 	<number>"/"<number>"+"  	a positive float with negative exponent
> 	<number>"/"<number>"-"  	a negative float with negative exponent
> 	<number>">"			positive infinity or NaN
> 	<number>"<"			negative infinity or NaN
> 	<number>"#"			false
> 	<number>"="			true
> 	<number>"!"			null
> 	<number>"<chars>		a Unicode string
> 	<number>"["(<item>)*		a sequence
> 	<number>"{"(<key><item>)*	a dictionary
> 
> where <number> is a possibly empty sequence of decimal digits
> with no leading zeros.  (In particular, 0 -> "+", not "0+".)
> This means that null, false, true come out as "!", "#", "=".
> For extra compactness, numbers could be encoded in base 64,
> using the digits 0-9A-Za-z$@,
> but experimentally, that only saves a couple of percent.
> The numbers reported above use decimal encoding.
> 
> There is no restriction here that the keys of an "object" can be only
> strings; that's for a higher level protocol to decide.  If sender and
> receiver can both handle more general dictionaries, why not?
> 
> <n>*<m> stands for <n>*(2**<m>) and <n>/<m> stands for <n>/(2**<m>).
> The representation is unique:  either <n> is empty and <m> is
> empty (using "*+" for +0.0 and "*-" for -0.0) or <n> is an odd
> integer. ">" and "<" are +infinity and -infinity respectively;
> other leading numbers indicate NaNs.
> 
> 
> TNetStrings claims the following advantages:
> 
> 1.  Trivial to parse in every language without making errors.
> 
>    FALSE.  To parse an array, you have to first read in the
>    whole string, and then recursively decode it.  You can't
>    decode as you go because you don't find out what it _is_
>    until you read the end.
> 
> 2.  Resistant to buffer overflows and other problems.
> 
>    MISLEADING.  Whether there can be buffer overflows depends
>    on the implementation.  My JSON parser is not subject to
>    buffer overflows, and I can't think why any competent programmer's
>    JSON parser would be.
> 
> 3.  Fast and low resource intensive.
> 
>    FALSE.
> 
> 4.  Makes no assumptions about string contents and can store binary data
>    without escaping or encoding them.
> 
>    UNCLEAR.  Dan Bernstein's netstrings proposal was specific to 8-bit
>    characters.  It was unsuitable for transmitting text between 8-bit
>    systems using different code pages.  TNetStrings says that all
>    counts are *byte* counts and all data are *byte* sequences, which
>    leaves the handling of Unicode text -- essential if this is to serve
>    where JSON serves -- totally unclear.  If Unicode text is transmitted
>    in UTF-8, then it *cannot* store binary data without escaping or
>    encoding.  This may even be FALSE, because floating point numbers
>    count as binary data in my book, and certainly have to be encoded in
>    this format.  The idea of representing 1.23e-20 as
>    0.0000000000000000000123 strikes me as gratuitously odd; you don't
>    want to see 1.2345e300 !
> 
> 5.  Backward compatible with original netstrings.
> 
>    TRUE, but so what?
> 
> 6.  Transport agnostic, so it works with streams, messages, files,
>    anything that's 8-bit clean.
> 
>    TRUEish.  There is an unstated assumption that we are dealing
>    with *byte* streams, not *text* streams, which is what makes
>    claim 4 almost certainly false.  In any case, JSON also has
>    this property, and so does CSON.
> 
> CSON claims the following advantages:
> 
> 1.  Easy to generate, including streaming output as long as you
>    know array/object element counts when you start to write,
>    creating *no* intermediate data structures.
> 
> 2.  Easy to read, creating *no* intermediate data structures.
> 
> 3.  Handles Unicode.
> 
> 4.  Handles floating point numbers precisely as long as the
>    receiver is able to hold the numbers.
> 
> 5.  Encoded data are byte streams, just like JSON.
> 
> and the following disadvantage:
> 
> 6.  You can skip an unwanted item without *allocating* it but
>    not without *decoding* it.
> 
> 
> 
> 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

"Installing applications can lead to corruption over time. Applications gradually write over each other's libraries, partial upgrades occur, user and system errors happen, and minute changes may be unnoticeable and difficult to fix"



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130729/e734a2ec/attachment.htm>


More information about the erlang-questions mailing list