[erlang-questions] Changing the representation of sets

Michał Muskała michal@REDACTED
Tue Apr 25 18:40:47 CEST 2017


I checked how the memory sizes compare.

Map-based sets are indeed larger - about 70% larger in memory than current sets, however they are smaller when serialised with term_to_binary (assuming using [] for the bogus value) by about 5%. That is when storing plain integers - when storing bigger terms, the difference quickly melts.

Maps are smaller than dict - by about 12% in memory and 40% when serialised.

Michał.

On 25 Apr 2017, 09:18 +0200, Michael Truog <mjtruog@REDACTED>, wrote:
> On 04/24/2017 11:51 PM, Michał Muskała wrote:
> > Hello everybody.
> >
> > The sets module implements an opaque structure to represent sets. The current implementation is based on hashes and nested tuples. Since OTP 18, we have maps that are an excellent and performant data structure for storing arbitrary keys & values. This means, they can be used to efficiently implement sets, by storing set elements as keys with some default value.
> >
> > My measurements [1] show that a set implementation based on maps is 2-4 times faster in all operations than the current implementation from the sets module. Comparing performance tradeoffs to other set implementations, it seems that map-based sets have exactly the same characteristics as current sets, except they are faster. Map-based sets also have a huge advantage of being easily readable when printed in the console. Given the sets data structure is opaque it should be possible to just switch implementations and make everybody's code faster without changing a line of code. The only issue might be incorrect uses of the sets module that assume some particular shape for the structure - I'm not sure how widespread that is.
> >
> > Another possibility would be to implement a new map_sets module, but I think 4 set implementation is already too many - no need to add another. There's also no need to write a new implementation since there's already a well tested, map-based set implementation available in OTP in the cerl_sets module in the compiler application [2].
> >
> > A similar consideration could be made for the dict module, but I haven't looked closely at the performance differences there - I would expect something similar, though, since sets and dict use a similar implementation.
> >
> > Regards,
> > Michał.
> >
> > [1]: https://gist.github.com/michalmuskala/9b0b55eee25b94d9339edbf7a0e261ea
> > [2]: https://github.com/erlang/otp/blob/master/lib/compiler/src/cerl_sets.erl
> >
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
>
> It is best to avoid modifying dict and sets, due to unintended consequences.  One I can think of is the impact on memory allocation when a dict or sets is sent through distributed Erlang, since the memory allocation for a map larger than 32 elements (MAP_SMALL_MAP_LIMIT) gets an allocation from HASHMAP_ESTIMATED_HEAP_SIZE(SIZE) == (SIZE*3 + (2*SIZE/5)*2) which can severely exaggerate the size of the map (details in erts/emulator/beam/erl_map.h).  There may be other impacts also, but that is unclear.  Keeping maps and dict/sets separate should be safest to avoid unintended changes in legacy source code.
>
> Best Regards,
> Michael
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170425/a2a25147/attachment.htm>


More information about the erlang-questions mailing list