[erlang-questions] term_to_binary and large data structures

Wed Jul 4 14:56:08 CEST 2018

On 2018-07-04 13:23, Michał Muskała wrote:

> I also believe the current format for maps, which is key1, value1, 
> key2, value2, ... is
> not that great for compression. Often, you'd have maps with exact the 
> same keys
> (especially in Elixir with structs), and there, a pattern of key1, 
> key2, ..., value1,
> value2, ..., should be much better (since the entire keys structure 
> could be compressed
> between similar maps).

I can confirm that this is an accurate observation. While not done in 
Packer, there are notes about this in Packer's code which was the result 
of some experiments around this. For maps, and *especially* structs in 
Elixir, this can indeed be a huge win for some messages.

Even more farther afield: what would be a real win, but much harder to 
accomplish, would be streaming compression. There are protocols (e.g. 
imap) which can offload compression of common patterns between messages 
to entries in the compression look up tables. The compression is applied 
to the entire network stream for the life of the connection and all data 
that goes through it is compressed in a single stream. So when a message 
has the same byte sequence as a previous message the comrpessor ends up 
turning that into a reference to an already existing entry in a look-up 
table.

The hard(er) part for BEAM distribution and this sort of thing would be 
managing the size of the lookup table as these connections are meant to 
be both long-lived and not consume infinite resources ;) So unlike 
(relatively) short-lived and highly repetitive imap connections, this 
would probably require something custom made to task which would keep a 
cache of most used terms (with all that comes with that, including cache 
invalidation).

Compared to just working on the term serialization, that feels a bit 
like rocket science at the moment. But getting maps in the same message 
more efficiently packed is definitely doable :)

--
Aaron