<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 04/24/2017 11:51 PM, Michał Muskała

      wrote:<br>

    </div>

    <blockquote cite="mid:b0293851-28fc-4b57-8fb0-825c8e668c54@Spark"

      type="cite">

      <title></title>

      <div name="messageBodySection" style="font-size: 14px;

        font-family: -apple-system, BlinkMacSystemFont, sans-serif;">Hello

        everybody.

        <div><br>

        </div>

        <div>The sets module implements an opaque structure to represent

          sets. The current implementation is based on hashes and nested

          tuples. Since OTP 18, we have maps that are an excellent and

          performant data structure for storing arbitrary keys &

          values. This means, they can be used to efficiently implement

          sets, by storing set elements as keys with some default

          value. </div>

        <div><br>

        </div>

        <div>My measurements [1] show that a set implementation based on

          maps is 2-4 times faster in all operations than the current

          implementation from the sets module. Comparing performance

          tradeoffs to other set implementations, it seems that

          map-based sets have exactly the same characteristics as

          current sets, except they are faster. Map-based sets also have

          a huge advantage of being easily readable when printed in the

          console. Given the sets data structure is opaque it should be

          possible to just switch implementations and make everybody's

          code faster without changing a line of code. The only issue

          might be incorrect uses of the sets module that assume some

          particular shape for the structure - I'm not sure how

          widespread that is.</div>

        <div><br>

        </div>

        <div>Another possibility would be to implement a new map_sets

          module, but I think 4 set implementation is already too many -

          no need to add another. There's also no need to write a new

          implementation since there's already a well tested, map-based

          set implementation available in OTP in the cerl_sets module in

          the compiler application [2].</div>

        <div><br>

        </div>

        <div>A similar consideration could be made for the dict module,

          but I haven't looked closely at the performance differences

          there - I would expect something similar, though, since sets

          and dict use a similar implementation.</div>

      </div>

      <div name="messageSignatureSection" style="font-size: 14px;

        font-family: -apple-system, BlinkMacSystemFont, sans-serif;"><br>

        Regards,

        <div>Michał.</div>

        <div><br>

        </div>

        <div>[1]: <a moz-do-not-send="true"

href="https://gist.github.com/michalmuskala/9b0b55eee25b94d9339edbf7a0e261ea">https://gist.github.com/michalmuskala/9b0b55eee25b94d9339edbf7a0e261ea</a></div>

        <div>[2]: <a moz-do-not-send="true"

href="https://github.com/erlang/otp/blob/master/lib/compiler/src/cerl_sets.erl">https://github.com/erlang/otp/blob/master/lib/compiler/src/cerl_sets.erl</a></div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

erlang-questions mailing list

<a class="moz-txt-link-abbreviated" href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>

<a class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-questions">http://erlang.org/mailman/listinfo/erlang-questions</a>

</pre>

    </blockquote>

    <br>

    It is best to avoid modifying dict and sets, due to unintended

    consequences.  One I can think of is the impact on memory allocation

    when a dict or sets is sent through distributed Erlang, since the

    memory allocation for a map larger than 32 elements (<code>MAP_SMALL_MAP_LIMIT</code>)

    gets an allocation from <code>HASHMAP_ESTIMATED_HEAP_SIZE(SIZE) ==

    </code><code>(SIZE*3 + (2*SIZE/5)*2) which can severely </code>exaggerate

    the size of the map (details in<code> erts/emulator/beam/erl_map.h</code>). 

    There may be other impacts also, but that is unclear.  Keeping maps

    and dict/sets separate should be safest to avoid unintended changes

    in legacy source code.<br>

    <br>

    Best Regards,<br>

    Michael<br>

    <br>

  </body>

</html>