<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<br>
<div class="moz-cite-prefix">On 25/09/2017 15:48, Jesper Louis
Andersen wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr">On Mon, Sep 25, 2017 at 8:35 AM Grzegorz Junka
<<a href="mailto:list1@gjunka.com" moz-do-not-send="true">list1@gjunka.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p>Fair enough, I missed that you can set ordered_set in
ETS, but nevertheless it won't work with maps. However,
even if I use ETS the Key is still stored twice, isn't?
(once as the key for ordered_set, once as the value when
the Id is the key).<br>
</p>
</div>
<div text="#000000" bgcolor="#FFFFFF"> <br>
</div>
</blockquote>
</div>
<div class="gmail_quote"><br>
</div>
<div class="gmail_quote">It depends. If it is a large binary
(larger than 64 characters) it will go on the binary heap once
unless you form it again and again in your code. Otherwise it
will take up the double amount of space.</div>
</div>
</blockquote>
<br>
I don't think it's that easy. If ordered_set is using B+Tree then
the key would be split into segments to annotate nodes of the tree.
But the value would be left untouched. Binaries could potentially be
reused by referencing segments of the same binary. So if binaries
would be reused or not depends on the actual implementation of
ordered_set. Also, with ETS the data must be copied between the
database and the process. It's possible that the VM will not
actually copy the binary but instead will create another reference
to it in the process. But this is all valid only for binaries. For
any other Erlang term what I wrote earlier would hold. I am not sure
I want to rely on a solution with so many unknowns.<br>
<br>
<blockquote type="cite"
cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote"><br>
</div>
<div class="gmail_quote">We still don't know what the underlying
problem statement is. This makes it harder to solve because
whenever we come up with a solution, a new constraint is added
and we have to adapt.</div>
</div>
</blockquote>
<br>
I did state all the constraints I have:<br>
<br>
In short, I have keys, which may be any Erlang terms, and numerical
Ids assigned to those terms. Keys must be sorted. Numerical Ids are
increasing monotonically. Then the following lookups should be
efficient:<br>
<br>
1. Having a key quickly search for its numeric Id<br>
2. Having a numeric Id quickly get back the key<br>
<br>
Also the following conditions should be met:<br>
<br>
3. The same key should be always assigned the same numeric Id<br>
4. One key has always one numeric Id (and vice versa), so they are
always added or removed together<br>
<br>
Since keys can be of any length I don't want to store them more than
once.<br>
<br>
<blockquote type="cite"
cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote"><br>
</div>
<div class="gmail_quote">One advantage of using something like a
gb_trees for the above is that you only have the key once in
the process heap and everything else will be pointers. It is
also possible to use the above scheme with gb_trees. But then
again, with ETS you can do key lookup from any process,
whereas it has to factor through the tree owner with gb_trees.</div>
</div>
</blockquote>
<br>
gb_tree would not prevent from having to store the key twice (once
as the key for gb_tree and once as the value). Not sure why you
mention gb_tree here?<br>
<br>
<blockquote type="cite"
cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote"><br>
</div>
<div class="gmail_quote">There is also the risk your problem is
entirely too large to fit in memory at all, but we don't yet
know the size of your N and how large of a machine you are
willing to scale to, so it is hard to do any napkin math on
the size here.</div>
</div>
</blockquote>
<br>
It's part of a bigger problem and I don't want to be getting into
describing it in its entirety. Essentially I want to design a sorted
index of terms that can hold billions of entries. For that I want it
to be distributed among multiple processes, each process holding a
part of the whole index. For now I am just looking into a data
structure suitable for implementing one of those processes. ETS
would be a valid solution if not that it's opaque - I don't have
control over how the data is stored. Let's say that I am
investigating alternatives.<br>
<br>
<blockquote type="cite"
cite="mid:CAGrdgiU8rV9b9r+ZsVWo+pef=yBjZ_hCvEVW=MUYPTWhohbAjA@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote"><br>
</div>
<div class="gmail_quote">In short, we need more info if we are
to come up with a better solution :)<br>
</div>
</div>
</blockquote>
<br>
That's the whole point. How would you implement an RDF/triple
database in Erlang? :)<br>
<br>
GrzegorzJ<br>
<br>
</body>
</html>