[erlang-questions] Tuples referencing each other problem
Grzegorz Junka
list1@REDACTED
Mon Sep 25 21:16:14 CEST 2017
On 25/09/2017 15:48, Jesper Louis Andersen wrote:
> On Mon, Sep 25, 2017 at 8:35 AM Grzegorz Junka <list1@REDACTED
> <mailto:list1@REDACTED>> wrote:
>
> Fair enough, I missed that you can set ordered_set in ETS, but
> nevertheless it won't work with maps. However, even if I use ETS
> the Key is still stored twice, isn't? (once as the key for
> ordered_set, once as the value when the Id is the key).
>
>
>
> It depends. If it is a large binary (larger than 64 characters) it
> will go on the binary heap once unless you form it again and again in
> your code. Otherwise it will take up the double amount of space.
I don't think it's that easy. If ordered_set is using B+Tree then the
key would be split into segments to annotate nodes of the tree. But the
value would be left untouched. Binaries could potentially be reused by
referencing segments of the same binary. So if binaries would be reused
or not depends on the actual implementation of ordered_set. Also, with
ETS the data must be copied between the database and the process. It's
possible that the VM will not actually copy the binary but instead will
create another reference to it in the process. But this is all valid
only for binaries. For any other Erlang term what I wrote earlier would
hold. I am not sure I want to rely on a solution with so many unknowns.
>
> We still don't know what the underlying problem statement is. This
> makes it harder to solve because whenever we come up with a solution,
> a new constraint is added and we have to adapt.
I did state all the constraints I have:
In short, I have keys, which may be any Erlang terms, and numerical Ids
assigned to those terms. Keys must be sorted. Numerical Ids are
increasing monotonically. Then the following lookups should be efficient:
1. Having a key quickly search for its numeric Id
2. Having a numeric Id quickly get back the key
Also the following conditions should be met:
3. The same key should be always assigned the same numeric Id
4. One key has always one numeric Id (and vice versa), so they are
always added or removed together
Since keys can be of any length I don't want to store them more than once.
>
> One advantage of using something like a gb_trees for the above is that
> you only have the key once in the process heap and everything else
> will be pointers. It is also possible to use the above scheme with
> gb_trees. But then again, with ETS you can do key lookup from any
> process, whereas it has to factor through the tree owner with gb_trees.
gb_tree would not prevent from having to store the key twice (once as
the key for gb_tree and once as the value). Not sure why you mention
gb_tree here?
>
> There is also the risk your problem is entirely too large to fit in
> memory at all, but we don't yet know the size of your N and how large
> of a machine you are willing to scale to, so it is hard to do any
> napkin math on the size here.
It's part of a bigger problem and I don't want to be getting into
describing it in its entirety. Essentially I want to design a sorted
index of terms that can hold billions of entries. For that I want it to
be distributed among multiple processes, each process holding a part of
the whole index. For now I am just looking into a data structure
suitable for implementing one of those processes. ETS would be a valid
solution if not that it's opaque - I don't have control over how the
data is stored. Let's say that I am investigating alternatives.
>
> In short, we need more info if we are to come up with a better solution :)
That's the whole point. How would you implement an RDF/triple database
in Erlang? :)
GrzegorzJ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170925/0d3441f8/attachment.htm>
More information about the erlang-questions
mailing list