[erlang-questions] term_to_binary and large data structures

Thu Jun 28 12:53:18 CEST 2018

On Wed, Jun 27, 2018 at 11:24 PM Aaron Seigo <aseigo@REDACTED> wrote:

> In fact, for nearly any term you throw at it, this pretty simple
> algorithm produces smaller serialized data. You can see the format
> employed here:
>
>       https://github.com/aseigo/packer/blob/develop/FORMAT.md
>
>
That is also a format with different properties. The external term format
doubles as an on-disk format where you might have the need to be robust
against a few bad blocks. Schema-based formats tend to be worse than
bloaty-prefix-encoded formats here. It probably hurts Elixir more since the
map() type is the underlying standard type for many things whereas a record
in Erlang is a tuple with some compile time expansion on top. In short,
keys are not repeated in this format.

You might want to look into Joe Armstrong's UBF stack in which you define
the schema as a small stack engine and then interpret that engine to
produce data. It serves as a hybrid in which you have a schema, but since
the stack engine supports a duplicate-instruction, you can repeat keys in
maps if they are the same and so on. In turn, you still have a prefix-like
encoding, but it compresses far better for schemes where you have many
repeated keys.

If you want to have a header-schema, it is probably worth it to just take
Protobuf3 and see how well that format handles the data. It has an encoding
scheme, varints and ZigZag encoding which represents integers in a way
which makes small integers small in the data stream, and also compresses
well. So for real-world data, this encoding tend to win.

Ephemeral data transfer between nodes could benefit from having a new
format, or an update which packs maps better.

emulator/beam/external.c:BIF_RETTYPE term_to_binary_1(BIF_ALIST_1)

is the place you want to start looking. Be cautious of the following
caveats:

* You must write your code so it can be preempted (see the trap variants)
* The distribution protocol is a bit different and has an atom-cache for
common atoms. I'm not entirely sure this is the entry-point for it
* We need backwards compatibility. In many cases even small changes to
this format has proven to be riddled with loss of compatibility.
* We might want to rethink ordering properties of the produced binary. I
know this has been a want historically, but I'm not sure we should grant
that wish :P
* For distribution: More plugability would be really cool to have.

Finally, as for the goal of distributing computation: distribute data is
still my advice. If data is distributed, computation distributes trivially.
Moving data around is going to be a major bottleneck going forward, and the
more data you amass, the more you are going to pay moving that data around.
Better distribution formats just shaves a constant factor, so you
eventually hit the same bottleneck in the long run. The other problem is
that any kind of shared/copied data requires locking or coordination.
Anything involving those parallelizes badly.

-- 
J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180628/4a0743a3/attachment.htm>