[erlang-questions] term_to_binary and large data structures
Aaron Seigo
aseigo@REDACTED
Wed Jun 27 23:24:35 CEST 2018
On 2018-06-27 17:05, Jesper Louis Andersen wrote:
> The map() type now has iterators, so you can gradually iterate over the
> map rather
> than having to convert it all at once. Maybe that is what is helping
> you.
That could well be it.
> However, I'd strongly recommend you start building up a scheme in which
> you chunk the
> large messages into smaller messages with some kind of continuation
> token.
We already do. This does not, however, resolve the real issue which is
bandwidth usage. Chunking the messages just makes smaller bits of bloat.
The total bloat is exactly the same, however, and easily inundates Gbit
and even 10Gbit networking. Except now we have the _added overhead_ of
more messages.
It's merely a way to shuffle forward, not a path to anything scalable.
> Large messages are bound to create trouble at some point.
Yes, if unbounded, I would agree. However, that is not our case.
We have maps with 10k keys that strain this system and easily saturate
our network. This is not "big" by any modern definition. As a
demonstration of this to ourselves, I wrote an Elixir library that
serializes terms to a more space efficient format. Where
`term_to_binary` creates 500MB monsters, this library conveniently
creates a 1.5MB binary out of the exact same data.
In fact, for nearly any term you throw at it, this pretty simple
algorithm produces smaller serialized data. You can see the format
employed here:
https://github.com/aseigo/packer/blob/develop/FORMAT.md
Given that it routinely produces results anywhere from 33% to 99% (!!)
smaller just shows how problematic the current external term format is.
Unfortunately, this is "only" an Elixir implementation and so is not
very fast at this point. The point of the exercise was to see what a
reasonable term serializer could produce, specifically to see if there
was any improvement to be had.
Apparently there is quite a bit. The external term format is widely
used, so improvements in it could be far reaching.
How difficult would it be to change the external term format, based on
e.g. the versioning in the distribution header? Would it be possible to
make term serialization pluggable as more and more of the the rest of
the distribution framework in the BEAM has become in v 21?
> You can also toy with the
> idea of moving the code to the data rather than data to the code.
Our goal is to distribute computation, so that would be
counter-productive.
--
Aaron
More information about the erlang-questions
mailing list