<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 04/24/2017 11:51 PM, Michał Muskała
wrote:<br>
</div>
<blockquote cite="mid:b0293851-28fc-4b57-8fb0-825c8e668c54@Spark"
type="cite">
<title></title>
<div name="messageBodySection" style="font-size: 14px;
font-family: -apple-system, BlinkMacSystemFont, sans-serif;">Hello
everybody.
<div><br>
</div>
<div>The sets module implements an opaque structure to represent
sets. The current implementation is based on hashes and nested
tuples. Since OTP 18, we have maps that are an excellent and
performant data structure for storing arbitrary keys &
values. This means, they can be used to efficiently implement
sets, by storing set elements as keys with some default
value. </div>
<div><br>
</div>
<div>My measurements [1] show that a set implementation based on
maps is 2-4 times faster in all operations than the current
implementation from the sets module. Comparing performance
tradeoffs to other set implementations, it seems that
map-based sets have exactly the same characteristics as
current sets, except they are faster. Map-based sets also have
a huge advantage of being easily readable when printed in the
console. Given the sets data structure is opaque it should be
possible to just switch implementations and make everybody's
code faster without changing a line of code. The only issue
might be incorrect uses of the sets module that assume some
particular shape for the structure - I'm not sure how
widespread that is.</div>
<div><br>
</div>
<div>Another possibility would be to implement a new map_sets
module, but I think 4 set implementation is already too many -
no need to add another. There's also no need to write a new
implementation since there's already a well tested, map-based
set implementation available in OTP in the cerl_sets module in
the compiler application [2].</div>
<div><br>
</div>
<div>A similar consideration could be made for the dict module,
but I haven't looked closely at the performance differences
there - I would expect something similar, though, since sets
and dict use a similar implementation.</div>
</div>
<div name="messageSignatureSection" style="font-size: 14px;
font-family: -apple-system, BlinkMacSystemFont, sans-serif;"><br>
Regards,
<div>Michał.</div>
<div><br>
</div>
<div>[1]: <a moz-do-not-send="true"
href="https://gist.github.com/michalmuskala/9b0b55eee25b94d9339edbf7a0e261ea">https://gist.github.com/michalmuskala/9b0b55eee25b94d9339edbf7a0e261ea</a></div>
<div>[2]: <a moz-do-not-send="true"
href="https://github.com/erlang/otp/blob/master/lib/compiler/src/cerl_sets.erl">https://github.com/erlang/otp/blob/master/lib/compiler/src/cerl_sets.erl</a></div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
erlang-questions mailing list
<a class="moz-txt-link-abbreviated" href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>
<a class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-questions">http://erlang.org/mailman/listinfo/erlang-questions</a>
</pre>
</blockquote>
<br>
It is best to avoid modifying dict and sets, due to unintended
consequences. One I can think of is the impact on memory allocation
when a dict or sets is sent through distributed Erlang, since the
memory allocation for a map larger than 32 elements (<code>MAP_SMALL_MAP_LIMIT</code>)
gets an allocation from <code>HASHMAP_ESTIMATED_HEAP_SIZE(SIZE) ==
</code><code>(SIZE*3 + (2*SIZE/5)*2) which can severely </code>exaggerate
the size of the map (details in<code> erts/emulator/beam/erl_map.h</code>).
There may be other impacts also, but that is unclear. Keeping maps
and dict/sets separate should be safest to avoid unintended changes
in legacy source code.<br>
<br>
Best Regards,<br>
Michael<br>
<br>
</body>
</html>