[erlang-questions] term_to_binary and large data structures

Tue Jul 10 22:13:05 CEST 2018

It could be possible to expose the distribution protocol's
`term_to_binary/1` atom cache for external use by other `term_to_binary/1`
users.

The atom cache is, itself, a kind of streaming dictionary compression; it
replaces a protocol or format with one outermost message type (a serialized
term) with one with two outermost message types: a serialized term
containing potential dictionary references; or an update to the dictionary.

The dictionary, in this case, contains whole terms (atoms) only. This is
essentially the same as what HPACK does for HTTP message streams. HPACK has
a very simple API (see https://github.com/twitter/hpack), and the Erlang
equivalent could be even simpler.

I'll write up a proposal for this.

On Wed, Jul 4, 2018 at 5:56 AM, Aaron Seigo <aseigo@REDACTED> wrote:

> On 2018-07-04 13:23, Michał Muskała wrote:
>
> I also believe the current format for maps, which is key1, value1, key2,
>> value2, ... is
>> not that great for compression. Often, you'd have maps with exact the
>> same keys
>> (especially in Elixir with structs), and there, a pattern of key1, key2,
>> ..., value1,
>> value2, ..., should be much better (since the entire keys structure could
>> be compressed
>> between similar maps).
>>
>
> I can confirm that this is an accurate observation. While not done in
> Packer, there are notes about this in Packer's code which was the result of
> some experiments around this. For maps, and *especially* structs in Elixir,
> this can indeed be a huge win for some messages.
>
> Even more farther afield: what would be a real win, but much harder to
> accomplish, would be streaming compression. There are protocols (e.g. imap)
> which can offload compression of common patterns between messages to
> entries in the compression look up tables. The compression is applied to
> the entire network stream for the life of the connection and all data that
> goes through it is compressed in a single stream. So when a message has the
> same byte sequence as a previous message the comrpessor ends up turning
> that into a reference to an already existing entry in a look-up table.
>
> The hard(er) part for BEAM distribution and this sort of thing would be
> managing the size of the lookup table as these connections are meant to be
> both long-lived and not consume infinite resources ;) So unlike
> (relatively) short-lived and highly repetitive imap connections, this would
> probably require something custom made to task which would keep a cache of
> most used terms (with all that comes with that, including cache
> invalidation).
>
> Compared to just working on the term serialization, that feels a bit like
> rocket science at the moment. But getting maps in the same message more
> efficiently packed is definitely doable :)
>
>
> --
> Aaron
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180710/c8e3f2ef/attachment.htm>