[erlang-questions] Tuples referencing each other problem

Mon Sep 25 22:53:08 CEST 2017

On Mon, Sep 25, 2017 at 9:16 PM Grzegorz Junka <list1@REDACTED> wrote:

>
> gb_tree would not prevent from having to store the key twice (once as the
> key for gb_tree and once as the value). Not sure why you mention gb_tree
> here?
>
>
The reason I mention this is because if we have:

Key = T, % for some arbitrary term T
Id = Horizon, % for some horizon value
Tree2 = gb_trees:enter(Key, Id, Tree1),
Tree3 = gb_trees:enter(Id, Key, Tree2),

then the `Key` is immutable and thus shared. It will in practice be of
pointer size (4-8 bytes). This only holds if you are in a gb_trees setting
because if we start going to ETS for this, then we obviously might lose the
sharing. Perhaps it does work if you insert multiple entries at once.

That's the whole point. How would you implement an RDF/triple database in
> Erlang? :)
>
>
The obvious candidate to beat is Postgres with an appropriately designed
table structure and indexes. If we can't beat this, we are not even in
game. It is also fairly easy to write, so first, we should use that as a
baseline implementation.

If our RDF store is fairly small and can fit in memory, I'd probably just
store one ETS table doing a bidrectional ID<->Key mapping as above. And
then, depending on query needs store those ID's as {{spo, Subject,
Predicate}, Object} or such in ETS. This should allow fairly efficient
query, as most of the data resides in memory and lookup is going to be
fast. Some data waste is expected in these solutions. It might be
beneficial to just ignore the ID mapping for a start and do it naively.
This also establishes a model which can be run in unison with other models
via QuickCheck so we can check for agreement with a simpler model.

Going to billions of RDF triples will break the above scheme. I'd probably
do like the Cayley datastore in Go: store data in eleveldb and steal the
Cayley-format. In this format you store everything in one leveldb database
where different parts of the storage are mapped by different prefixes (IDs
are <<"z", SHA1:160/integer>> for instance. To run fast SPO, OSP, POS ...
indexed hashes are stored for these mappings as well. In addition a horizon
and history is tracked for each quad so we can answer a query in the past
faithfully if needed. Sharding/Batching might come in handy here at a later
point.

Leveled trees such as the one leveldb uses tend to be quite efficient
compared to B+-trees in many situations. It might be in this case too.
Because Cayley uses unique SHA1 hashes for data, we obtain that keys with
the same subject will be iterable because they will have the same prefix in
that part of the data store. We essentially pay with writes up-front to
make the lookups faster. Also note that leveldb will naturally keep often
used data in memory if tuned.

TL;DR: either data fits in memory or it doesn't :P
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170925/72321ad2/attachment.htm>