[erlang-questions] Tuples referencing each other problem

Tue Sep 26 00:25:05 CEST 2017

On 25/09/2017 20:53, Jesper Louis Andersen wrote:
> On Mon, Sep 25, 2017 at 9:16 PM Grzegorz Junka <list1@REDACTED 
> <mailto:list1@REDACTED>> wrote:
>
>
>     gb_tree would not prevent from having to store the key twice (once
>     as the key for gb_tree and once as the value). Not sure why you
>     mention gb_tree here?
>
>
> The reason I mention this is because if we have:
>
> Key = T, % for some arbitrary term T
> Id = Horizon, % for some horizon value
> Tree2 = gb_trees:enter(Key, Id, Tree1),
> Tree3 = gb_trees:enter(Id, Key, Tree2),
>
> then the `Key` is immutable and thus shared. It will in practice be of 
> pointer size (4-8 bytes). This only holds if you are in a gb_trees 
> setting because if we start going to ETS for this, then we obviously 
> might lose the sharing. Perhaps it does work if you insert multiple 
> entries at once.
>
>     That's the whole point. How would you implement an RDF/triple
>     database in Erlang? :)
>
>
> The obvious candidate to beat is Postgres with an appropriately 
> designed table structure and indexes. If we can't beat this, we are 
> not even in game. It is also fairly easy to write, so first, we should 
> use that as a baseline implementation.
>
> If our RDF store is fairly small and can fit in memory, I'd probably 
> just store one ETS table doing a bidrectional ID<->Key mapping as 
> above. And then, depending on query needs store those ID's as {{spo, 
> Subject, Predicate}, Object} or such in ETS. This should allow fairly 
> efficient query, as most of the data resides in memory and lookup is 
> going to be fast. Some data waste is expected in these solutions. It 
> might be beneficial to just ignore the ID mapping for a start and do 
> it naively. This also establishes a model which can be run in unison 
> with other models via QuickCheck so we can check for agreement with a 
> simpler model.
>
> Going to billions of RDF triples will break the above scheme. I'd 
> probably do like the Cayley datastore in Go: store data in eleveldb 
> and steal the Cayley-format. In this format you store everything in 
> one leveldb database where different parts of the storage are mapped 
> by different prefixes (IDs are <<"z", SHA1:160/integer>> for instance. 
> To run fast SPO, OSP, POS ... indexed hashes are stored for these 
> mappings as well. In addition a horizon and history is tracked for 
> each quad so we can answer a query in the past faithfully if needed. 
> Sharding/Batching might come in handy here at a later point.
>
> Leveled trees such as the one leveldb uses tend to be quite efficient 
> compared to B+-trees in many situations. It might be in this case too. 
> Because Cayley uses unique SHA1 hashes for data, we obtain that keys 
> with the same subject will be iterable because they will have the same 
> prefix in that part of the data store. We essentially pay with writes 
> up-front to make the lookups faster. Also note that leveldb will 
> naturally keep often used data in memory if tuned.
>
> TL;DR: either data fits in memory or it doesn't :P
>

Yeah, there dozens of possible solutions. Bear in mind that if you 
implement RDF on top of Postgres then you need to implement SPARQL on 
top of SQL. This is no longer an Erlang database. You can implement Riak 
or Couch like database on top of Postgres. The question is, why? The 
main reason for using Erlang is to be able distribute the database 
across multiple cores and nodes. Postgres as an SQL database scales very 
well but up to some point. After a terabyte of data or so it starts 
breaking down. You can extend the performance with some tricks but 
that's it.

It's not as much about finding a solution that is good enough (Postgres, 
ETS) but about designing an algorithm that allows to store and query 
triples in a way that can scale into thousands of nodes. Erlang is 
perfect for that because you can start with implementing an algorithm 
that works with a few thousand of processes on multiple CPU cores on one 
computer. If that works then the same algorithm can transparently be 
extended to run those processes across multiple nodes. And if that 
works, the algorithm can probably run millions of processes on those nodes.

You can compare this problem to a choice between a single big and 
complex solution (Postgres/SQL or Cray/mainframe) and many small 
distributed and cooperating nodes (Riak/NoSQL or a mesh of 
interconnected PCs).

RDF is by design a way of representing data that is most capable of 
distributing across nodes (distributing table cells rather than whole 
columns or tables). In such an environment the biggest problem, in my 
opinion, is to efficiently handle the LIMIT and ORDER query modifiers. 
Imagine having to sort a billion of RDF triples across a thousand of 
nodes just to return 10 triples with the least values/objects.

GrzegorzJ

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170925/356c5983/attachment.htm>