[erlang-questions] term_to_binary and large data structures
Aaron Seigo
aseigo@REDACTED
Mon Jul 2 16:07:20 CEST 2018
On 2018-06-28 12:53, Jesper Louis Andersen wrote:
> On Wed, Jun 27, 2018 at 11:24 PM Aaron Seigo <aseigo@REDACTED>
> wrote:
>
>> In fact, for nearly any term you throw at it, this pretty simple
>> algorithm produces smaller serialized data. You can see the format
>> employed here:
>>
>> https://github.com/aseigo/packer/blob/develop/FORMAT.md
>
> That is also a format with different properties. The external term
> format doubles as
> an on-disk format where you might have the need to be robust against a
> few bad blocks.
There are ways to achieve that with schema-based formats, as well, of
course... but, yes, as it currently stands the format above is not great
for a disk format, though I would argue that external term format (ETF?)
:) is also rather lacking for use on-disk unless one adds to the base
format.
> Schema-based formats tend to be worse than bloaty-prefix-encoded
> formats here. It
> probably hurts Elixir more since the map() type is the underlying
> standard type for
> many things whereas a record in Erlang is a tuple with some compile
> time expansion on
> top. In short, keys are not repeated in this format.
Yes, records have some benefits indeed. However, the problem is not just
maps. In fact, tuples are a big problem themselves. The ETF uses one
byte for the type code (104 for a tuple with 0-255; 108 for larger
tuples) and either a 1-byte or 4-byte integer containing the arity. That
is 1 byte more than necessary for small tuples, and 4 extra bytes for
larger tuples .. which is made more odd by the fact that tuples can only
have 2^24 elements maximum; reading whole words is theoretically faster,
so perhaps that was the motivation to tack on an extra byte ...
Still, in a common case it takes 2 bytes to encode an empty (!) tuple,
and 2 bytes to encode a small (the common case ime..) tuple. So a tuple
like {1,2} requires 6 bytes to encode, {1,2,3} takes 8 ... with the
compressed metadata approach this becomes 6 bytes for either of those
tuples. Stuff a list with a bunch of tuples and the overhead adds up
amazingly quickly: [{1, 2}, {3, 4}, {5, 6}] takes 25 bytes in the ETF,
and 18 with the packer library. add one more similar tuple and we end up
with 31 (!) bytes for ETF, and 20 with packer.
There are a wide range of inefficiencies in the ETF, so hopefully we
don't get sidetracked by maps alone.
(In our use case, it was tuples killing us, not maps!)
> You might want to look into Joe Armstrong's UBF stack in which you
> define the schema
> as a small stack engine and then interpret that engine to produce data.
> It serves as a
> hybrid in which you have a schema, but since the stack engine supports
> a duplicate-
> instruction, you can repeat keys in maps if they are the same and so
> on. In turn, you
Repeating data is easily, and usually more efficiently, compressed using
a traditional compression tool. Packer uses zstd and it does a lovely
job of this. It is really the metadata encoding ...
> If you want to have a header-schema, it is probably worth it to just
> take Protobuf3
> and see how well that format handles the data.
There are 2 major problems with using this sort of solution:
* none (that we are aware of) supports Erlang's range of data types;
tuples in particular. So you are back to being "on your own". I would
love to have had an OTS solution :)
* we can not (for reasons of practicality) always know the type of data
we will want to encode for a message, and so it is not possible to
create the sort of schema definitions that these systems want (and which
is an important source of their efficiency and safety).
So while they 100% make sense in the right use cases, they are not a
good fit for a general purpose erlang term serializer.
> It has an encoding scheme, varints and ZigZag encoding which represents
> integers in a > way which makes small integers small in the data
> stream, and also compresses well. So
> for real-world data, this encoding tend to win.
Packer does the small integers thing as well :)
> Ephemeral data transfer between nodes could benefit from having a new
> format, or an
> update which packs maps better.
... and tuples ;)
Given that Packer does an equivalent or better job in terms of resulting
number of bytes used in nearly every single term (and most of the
remaining exceptions are addressable with simple additions, such as
having a type code for empty list, list of small number, ala ETF), it is
clear it is not just maps.
> emulator/beam/external.c:BIF_RETTYPE term_to_binary_1(BIF_ALIST_1)
cool; will look .. but the reason I wrote to the list first was in hopes
to find out if there was appetite for such an improvement. IOW: are the
odds of a quality implementation being upstreamed high enough to warrant
our effort there :)
> * You must write your code so it can be preempted (see the trap
> variants)
Yep ...
> * The distribution protocol is a bit different and has an atom-cache
> for common atoms.
Yes, this is one place the distribution protocol has a leg up on packer
due to access to the atom cache.
> I'm not entirely sure this is the entry-point for it
Even if not, messages for distribution should utilize that cache?
> * We need backwards compatibility. In many cases even small changes to
> this format has
> proven to be riddled with loss of compatibility.
My hopes go as follows:
Use versioning in the distribution header to indicate that a newer
serialization format is used; fall back to the existing format when that
is not indicated. As there would be no change in data types expressable,
this should only alter the representation of the data on the wire.
> * We might want to rethink ordering properties of the produced binary.
> I know this has
> been a want historically, but I'm not sure we should grant that wish :P
Do you have pointers to discussions on this so I can understand what you
mean by "ordering properties" better? (Bug report, mailing list thread,
etc.?)
> * For distribution: More plugability would be really cool to have.
Agreed :)
> Finally, as for the goal of distributing computation: distribute data
> is still my
> advice. If data is distributed, computation distributes trivially.
Unfortunately for us, our compute is expensive relative to the cost of
the data. A small amount of data ends up creating a large amount of
computation, which produces other not-very-big sets of data (I don't
consider 100k pairs to be particularly large, esp when that can be
serialized to <2MB, just as I don't think anyone seriously considers a
1GB SQL database to be large; but maybe that's just my perspective :)
which in turns kicks off even more computation ... ... . the cost of
computation is such that we need many machines working together on
shared problems.
Our problem occupies a space somewhere between more typical Erlang use
cases (messaging, e.g.) and scientific computing, as it is neither but
has some similar properties to both.
As a result, for us it isn't enough to keep data local and move
computation to the data.
> Better distribution formats just shaves a constant factor
Indeed; however currently that is a MASSIVE constant. With the data we
have, it means the difference between 2.5mps (messages/second) on a
10Gbe networking verses 833mps. So the gap between the current ceiling
and the new ceiling would be massive. Long term, you are absolutely
correct, though, when you say:
> so you eventually hit the same bottleneck in the long run. The
Pre-compute sharding of the data space is an approach we use to avoid,
or at least significantly delay beyond our requirements, the onset of
that bottleneck ...
> problem is that any kind of shared/copied data requires locking or
> coordination.
> Anything involving those parallelizes badly. --
This is not a property of the data in this application, other than the
job queues, and that's only because we have those distributed for the
purposes of durability. They also represent ~0% of the run time in the
system, so the overhead there is ok.
Thanks for the insights there, though, as they are .. well .. insightful
;) and bang-on for most applications.
Cheers,
--
Aaron
More information about the erlang-questions
mailing list