[erlang-questions] Tuples referencing each other problem

Mon Sep 25 21:16:14 CEST 2017

On 25/09/2017 15:48, Jesper Louis Andersen wrote:
> On Mon, Sep 25, 2017 at 8:35 AM Grzegorz Junka <list1@REDACTED 
> <mailto:list1@REDACTED>> wrote:
>
>     Fair enough, I missed that you can set ordered_set in ETS, but
>     nevertheless it won't work with maps. However, even if I use ETS
>     the Key is still stored twice, isn't? (once as the key for
>     ordered_set, once as the value when the Id is the key).
>
>
>
> It depends. If it is a large binary (larger than 64 characters) it 
> will go on the binary heap once unless you form it again and again in 
> your code. Otherwise it will take up the double amount of space.

I don't think it's that easy. If ordered_set is using B+Tree then the 
key would be split into segments to annotate nodes of the tree. But the 
value would be left untouched. Binaries could potentially be reused by 
referencing segments of the same binary. So if binaries would be reused 
or not depends on the actual implementation of ordered_set. Also, with 
ETS the data must be copied between the database and the process. It's 
possible that the VM will not actually copy the binary but instead will 
create another reference to it in the process. But this is all valid 
only for binaries. For any other Erlang term what I wrote earlier would 
hold. I am not sure I want to rely on a solution with so many unknowns.

>
> We still don't know what the underlying problem statement is. This 
> makes it harder to solve because whenever we come up with a solution, 
> a new constraint is added and we have to adapt.

I did state all the constraints I have:

In short, I have keys, which may be any Erlang terms, and numerical Ids 
assigned to those terms. Keys must be sorted. Numerical Ids are 
increasing monotonically. Then the following lookups should be efficient:

1. Having a key quickly search for its numeric Id
2. Having a numeric Id quickly get back the key

Also the following conditions should be met:

3. The same key should be always assigned the same numeric Id
4. One key has always one numeric Id (and vice versa), so they are 
always added or removed together

Since keys can be of any length I don't want to store them more than once.

>
> One advantage of using something like a gb_trees for the above is that 
> you only have the key once in the process heap and everything else 
> will be pointers. It is also possible to use the above scheme with 
> gb_trees. But then again, with ETS you can do key lookup from any 
> process, whereas it has to factor through the tree owner with gb_trees.

gb_tree would not prevent from having to store the key twice (once as 
the key for gb_tree and once as the value). Not sure why you mention 
gb_tree here?

>
> There is also the risk your problem is entirely too large to fit in 
> memory at all, but we don't yet know the size of your N and how large 
> of a machine you are willing to scale to, so it is hard to do any 
> napkin math on the size here.

It's part of a bigger problem and I don't want to be getting into 
describing it in its entirety. Essentially I want to design a sorted 
index of terms that can hold billions of entries. For that I want it to 
be distributed among multiple processes, each process holding a part of 
the whole index. For now I am just looking into a data structure 
suitable for implementing one of those processes. ETS would be a valid 
solution if not that it's opaque - I don't have control over how the 
data is stored. Let's say that I am investigating alternatives.

>
> In short, we need more info if we are to come up with a better solution :)

That's the whole point. How would you implement an RDF/triple database 
in Erlang? :)

GrzegorzJ

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170925/0d3441f8/attachment.htm>