[erlang-questions] term_to_binary and large data structures

Wed Jun 27 23:24:35 CEST 2018

On 2018-06-27 17:05, Jesper Louis Andersen wrote:
> The map() type now has iterators, so you can gradually iterate over the 
> map rather
> than having to convert it all at once. Maybe that is what is helping 
> you.

That could well be it.

> However, I'd strongly recommend you start building up a scheme in which 
> you chunk the
> large messages into smaller messages with some kind of continuation 
> token.

We already do. This does not, however, resolve the real issue which is 
bandwidth usage. Chunking the messages just makes smaller bits of bloat. 
The total bloat is exactly the same, however, and easily inundates Gbit 
and even 10Gbit networking. Except now we have the _added overhead_ of 
more messages.

It's merely a way to shuffle forward, not a path to anything scalable.

> Large messages are bound to create trouble at some point.

Yes, if unbounded, I would agree. However, that is not our case.

We have maps with 10k keys that strain this system and easily saturate 
our network. This is not "big" by any modern definition. As a 
demonstration of this to ourselves, I wrote an Elixir library that 
serializes terms to a more space efficient format. Where 
`term_to_binary` creates 500MB monsters, this library conveniently 
creates a 1.5MB binary out of the exact same data.

In fact, for nearly any term you throw at it, this pretty simple 
algorithm produces smaller serialized data. You can see the format 
employed here:

      https://github.com/aseigo/packer/blob/develop/FORMAT.md

Given that it routinely produces results anywhere from 33% to 99% (!!) 
smaller just shows how problematic the current external term format is. 
Unfortunately, this is "only" an Elixir implementation and so is not 
very fast at this point. The point of the exercise was to see what a 
reasonable term serializer could produce, specifically to see if there 
was any improvement to be had.

Apparently there is quite a bit. The external term format is widely 
used, so improvements in it could be far reaching.

How difficult would it be to change the external term format, based on 
e.g. the versioning in the distribution header? Would it be possible to 
make term serialization pluggable as more and more of the the rest of 
the distribution framework in the BEAM has become in v 21?

> You can also toy with the
> idea of moving the code to the data rather than data to the code.

Our goal is to distribute computation, so that would be 
counter-productive.

--
Aaron