[erlang-questions] term_to_binary and large data structures

Mon Jul 2 16:07:20 CEST 2018

On 2018-06-28 12:53, Jesper Louis Andersen wrote:

> On Wed, Jun 27, 2018 at 11:24 PM Aaron Seigo <aseigo@REDACTED> 
> wrote:
> 
>> In fact, for nearly any term you throw at it, this pretty simple
>> algorithm produces smaller serialized data. You can see the format
>> employed here:
>> 
>> https://github.com/aseigo/packer/blob/develop/FORMAT.md
> 
> That is also a format with different properties. The external term 
> format doubles as
> an on-disk format where you might have the need to be robust against a 
> few bad blocks.

There are ways to achieve that with schema-based formats, as well, of 
course... but, yes, as it currently stands the format above is not great 
for a disk format, though I would argue that external term format (ETF?) 
:) is also rather lacking for use on-disk unless one adds to the base 
format.

> Schema-based formats tend to be worse than bloaty-prefix-encoded 
> formats here. It
> probably hurts Elixir more since the map() type is the underlying 
> standard type for
> many things whereas a record in Erlang is a tuple with some compile 
> time expansion on
> top. In short, keys are not repeated in this format.

Yes, records have some benefits indeed. However, the problem is not just 
maps. In fact, tuples are a big problem themselves. The ETF uses one 
byte for the type code (104 for a tuple with 0-255; 108 for larger 
tuples) and either a 1-byte or 4-byte integer containing the arity. That 
is 1 byte more than necessary for small tuples, and 4 extra bytes for 
larger tuples .. which is made more odd by the fact that tuples can only 
have 2^24 elements maximum; reading whole words is theoretically faster, 
so perhaps that was the motivation to tack on an extra byte ...

Still, in a common case it takes 2 bytes to encode an empty (!) tuple, 
and 2 bytes to encode a small (the common case ime..) tuple. So a tuple 
like {1,2} requires 6 bytes to encode, {1,2,3} takes 8 ... with the 
compressed metadata approach this becomes 6 bytes for either of those 
tuples. Stuff a list with a bunch of tuples and the overhead adds up 
amazingly quickly: [{1, 2}, {3, 4}, {5, 6}] takes 25 bytes in the ETF, 
and 18 with the packer library. add one more similar tuple and we end up 
with 31 (!) bytes for ETF, and 20 with packer.

There are a wide range of inefficiencies in the ETF, so hopefully we 
don't get sidetracked by maps alone.

(In our use case, it was tuples killing us, not maps!)

> You might want to look into Joe Armstrong's UBF stack in which you 
> define the schema
> as a small stack engine and then interpret that engine to produce data. 
> It serves as a
> hybrid in which you have a schema, but since the stack engine supports 
> a duplicate-
> instruction, you can repeat keys in maps if they are the same and so 
> on. In turn, you

Repeating data is easily, and usually more efficiently, compressed using 
a traditional compression tool. Packer uses zstd and it does a lovely 
job of this. It is really the metadata encoding ...

> If you want to have a header-schema, it is probably worth it to just 
> take Protobuf3
> and see how well that format handles the data.

There are 2 major problems with using this sort of solution:

  * none (that we are aware of) supports Erlang's range of data types; 
tuples in particular. So you are back to being "on your own". I would 
love to have had an OTS solution :)

  * we can not (for reasons of practicality) always know the type of data 
we will want to encode for a message, and so it is not possible to 
create the sort of schema definitions that these systems want (and which 
is an important source of their efficiency and safety).

So while they 100% make sense in the right use cases, they are not a 
good fit for a general purpose erlang term serializer.

> It has an encoding scheme, varints and ZigZag encoding which represents 
> integers in a > way which makes small integers small in the data 
> stream, and also compresses well. So
> for real-world data, this encoding tend to win.

Packer does the small integers thing as well :)

> Ephemeral data transfer between nodes could benefit from having a new 
> format, or an
> update which packs maps better.

... and tuples ;)

Given that Packer does an equivalent or better job in terms of resulting 
number of bytes used in nearly every single term (and most of the 
remaining exceptions are addressable with simple additions, such as 
having a type code for empty list, list of small number, ala ETF), it is 
clear it is not just maps.

> emulator/beam/external.c:BIF_RETTYPE term_to_binary_1(BIF_ALIST_1)

cool; will look .. but the reason I wrote to the list first was in hopes 
to find out if there was appetite for such an improvement. IOW: are the 
odds of a quality implementation being upstreamed high enough to warrant 
our effort there :)

> * You must write your code so it can be preempted (see the trap 
> variants)

Yep ...

> * The distribution protocol is a bit different and has an atom-cache 
> for common atoms.

Yes, this is one place the distribution protocol has a leg up on packer 
due to access to the atom cache.

> I'm not entirely sure this is the entry-point for it

Even if not, messages for distribution should utilize that cache?

> * We need backwards compatibility. In many cases even small changes to 
> this format has
> proven to be riddled with loss of compatibility.

My hopes go as follows:

Use versioning in the distribution header to indicate that a newer 
serialization format is used; fall back to the existing format when that 
is not indicated. As there would be no change in data types expressable, 
this should only alter the representation of the data on the wire.

> * We might want to rethink ordering properties of the produced binary. 
> I know this has
> been a want historically, but I'm not sure we should grant that wish :P

Do you have pointers to discussions on this so I can understand what you 
mean by "ordering properties" better? (Bug report, mailing list thread, 
etc.?)

> * For distribution: More plugability would be really cool to have.

Agreed :)

> Finally, as for the goal of distributing computation: distribute data 
> is still my
> advice. If data is distributed, computation distributes trivially.

Unfortunately for us, our compute is expensive relative to the cost of 
the data. A small amount of data ends up creating a large amount of 
computation, which produces other not-very-big sets of data (I don't 
consider 100k pairs to be particularly large, esp when that can be 
serialized to <2MB, just as I don't think anyone seriously considers a 
1GB SQL database to be large; but maybe that's just my perspective :) 
which in turns kicks off even more computation ... ... . the cost of 
computation is such that we need many machines working together on 
shared problems.

Our problem occupies a space somewhere between more typical Erlang use 
cases (messaging, e.g.) and scientific computing, as it is neither but 
has some similar properties to both.

As a result, for us it isn't enough to keep data local and move 
computation to the data.

> Better distribution formats just shaves a constant factor

Indeed; however currently that is a MASSIVE constant. With the data we 
have, it means the difference between 2.5mps (messages/second) on a 
10Gbe networking verses 833mps. So the gap between the current ceiling 
and the new ceiling would be massive. Long term, you are absolutely 
correct, though, when you say:

> so you eventually hit the same bottleneck in the long run. The

Pre-compute sharding of the data space is an approach we use to avoid, 
or at least significantly delay beyond our requirements, the onset of 
that bottleneck ...

> problem is that any kind of shared/copied data requires locking or 
> coordination.
> Anything involving those parallelizes badly. --

This is not a property of the data in this application, other than the 
job queues, and that's only because we have those distributed for the 
purposes of durability. They also represent ~0% of the run time in the 
system, so the overhead there is ok.

Thanks for the insights there, though, as they are .. well .. insightful 
;) and bang-on for most applications.

Cheers,

--
Aaron