<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <br>

    <div class="moz-cite-prefix">On 25/09/2017 15:48, Jesper Louis

      Andersen wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <div dir="ltr">On Mon, Sep 25, 2017 at 8:35 AM Grzegorz Junka

            <<a href="mailto:list1@gjunka.com" moz-do-not-send="true">list1@gjunka.com</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF">

              <p>Fair enough, I missed that you can set ordered_set in

                ETS, but nevertheless it won't work with maps. However,

                even if I use ETS the Key is still stored twice, isn't?

                (once as the key for ordered_set, once as the value when

                the Id is the key).<br>

              </p>

            </div>

            <div text="#000000" bgcolor="#FFFFFF"> <br>

            </div>

          </blockquote>

        </div>

        <div class="gmail_quote"><br>

        </div>

        <div class="gmail_quote">It depends. If it is a large binary

          (larger than 64 characters) it will go on the binary heap once

          unless you form it again and again in your code. Otherwise it

          will take up the double amount of space.</div>

      </div>

    </blockquote>

    <br>

    I don't think it's that easy. If ordered_set is using B+Tree then

    the key would be split into segments to annotate nodes of the tree.

    But the value would be left untouched. Binaries could potentially be

    reused by referencing segments of the same binary. So if binaries

    would be reused or not depends on the actual implementation of

    ordered_set. Also, with ETS the data must be copied between the

    database and the process. It's possible that the VM will not

    actually copy the binary but instead will create another reference

    to it in the process. But this is all valid only for binaries. For

    any other Erlang term what I wrote earlier would hold. I am not sure

    I want to rely on a solution with so many unknowns.<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote"><br>

        </div>

        <div class="gmail_quote">We still don't know what the underlying

          problem statement is. This makes it harder to solve because

          whenever we come up with a solution, a new constraint is added

          and we have to adapt.</div>

      </div>

    </blockquote>

    <br>

    I did state all the constraints I have:<br>

    <br>

    In short, I have keys, which may be any Erlang terms, and numerical

    Ids assigned to those terms. Keys must be sorted. Numerical Ids are

    increasing monotonically. Then the following lookups should be

    efficient:<br>

    <br>

    1. Having a key quickly search for its numeric Id<br>

    2. Having a numeric Id quickly get back the key<br>

    <br>

    Also the following conditions should be met:<br>

    <br>

    3. The same key should be always assigned the same numeric Id<br>

    4. One key has always one numeric Id (and vice versa), so they are

    always added or removed together<br>

    <br>

    Since keys can be of any length I don't want to store them more than

    once.<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote"><br>

        </div>

        <div class="gmail_quote">One advantage of using something like a

          gb_trees for the above is that you only have the key once in

          the process heap and everything else will be pointers. It is

          also possible to use the above scheme with gb_trees. But then

          again, with ETS you can do key lookup from any process,

          whereas it has to factor through the tree owner with gb_trees.</div>

      </div>

    </blockquote>

    <br>

    gb_tree would not prevent from having to store the key twice (once

    as the key for gb_tree and once as the value). Not sure why you

    mention gb_tree here?<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote"><br>

        </div>

        <div class="gmail_quote">There is also the risk your problem is

          entirely too large to fit in memory at all, but we don't yet

          know the size of your N and how large of a machine you are

          willing to scale to, so it is hard to do any napkin math on

          the size here.</div>

      </div>

    </blockquote>

    <br>

    It's part of a bigger problem and I don't want to be getting into

    describing it in its entirety. Essentially I want to design a sorted

    index of terms that can hold billions of entries. For that I want it

    to be distributed among multiple processes, each process holding a

    part of the whole index. For now I am just looking into a data

    structure suitable for implementing one of those processes. ETS

    would be a valid solution if not that it's opaque - I don't have

    control over how the data is stored. Let's say that I am

    investigating alternatives.<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote"><br>

        </div>

        <div class="gmail_quote">In short, we need more info if we are

          to come up with a better solution :)<br>

        </div>

      </div>

    </blockquote>

    <br>

    That's the whole point. How would you implement an RDF/triple

    database in Erlang? :)<br>

    <br>

    GrzegorzJ<br>

    <br>

  </body>

</html>