[erlang-questions] Tuples referencing each other problem

Tue Sep 26 21:41:33 CEST 2017

Thanks for your suggestions. As I mentioned, there are dozens of ways of 
building such a database. My question wasn't as much about the 
architecture as it was about the implementation of a particular 
algorithm in Erlang. We discussed the architecture only because some 
questioned the need to use such algorithm in the first place. My aim 
isn't to build something from existing elements (LevelDB/Riak/TAO), i.e. 
fit the problem to the tools I have. My aim is to design a solution that 
is best for the given problem, only then think about the tools that I 
would need to implement it. If I end up with something similar to 
LevelDB/Riak/TAO and can reuse them to simplify the implementation then 
it's even better. But I am still far away from such a discussion :-)

GrzegorzJ

On 26/09/2017 17:06, Jesper Louis Andersen wrote:
> The reason I mention Postgres is because it takes only a couple of 
> days to build a useful solution based on it. Most meta-data storage 
> I've seen easily fits within 1 Terabyte of memory, which means for 99% 
> of all use cases it tends to be enough.
>
> A distributed solution is going to measured in weeks or months to 
> build. If you want to go that route, I think LevelDB on individual 
> nodes, riak_core for handling a hash ring and some of the ideas from 
> facebook's TAO-paper for handling secondary key garbage collection 
> could be a rather scalable solution. This could probably scale into 
> the petabyte range if you add enough machines to the ring. But these 
> are Google/Facebook data sizes.
>
> On Tue, Sep 26, 2017 at 12:25 AM Grzegorz Junka <list1@REDACTED 
> <mailto:list1@REDACTED>> wrote:
>
>
>     On 25/09/2017 20:53, Jesper Louis Andersen wrote:
>>     On Mon, Sep 25, 2017 at 9:16 PM Grzegorz Junka <list1@REDACTED
>>     <mailto:list1@REDACTED>> wrote:
>>
>>
>>         gb_tree would not prevent from having to store the key twice
>>         (once as the key for gb_tree and once as the value). Not sure
>>         why you mention gb_tree here?
>>
>>
>>     The reason I mention this is because if we have:
>>
>>     Key = T, % for some arbitrary term T
>>     Id = Horizon, % for some horizon value
>>     Tree2 = gb_trees:enter(Key, Id, Tree1),
>>     Tree3 = gb_trees:enter(Id, Key, Tree2),
>>
>>     then the `Key` is immutable and thus shared. It will in practice
>>     be of pointer size (4-8 bytes). This only holds if you are in a
>>     gb_trees setting because if we start going to ETS for this, then
>>     we obviously might lose the sharing. Perhaps it does work if you
>>     insert multiple entries at once.
>>
>>         That's the whole point. How would you implement an RDF/triple
>>         database in Erlang? :)
>>
>>
>>     The obvious candidate to beat is Postgres with an appropriately
>>     designed table structure and indexes. If we can't beat this, we
>>     are not even in game. It is also fairly easy to write, so first,
>>     we should use that as a baseline implementation.
>>
>>     If our RDF store is fairly small and can fit in memory, I'd
>>     probably just store one ETS table doing a bidrectional ID<->Key
>>     mapping as above. And then, depending on query needs store those
>>     ID's as {{spo, Subject, Predicate}, Object} or such in ETS. This
>>     should allow fairly efficient query, as most of the data resides
>>     in memory and lookup is going to be fast. Some data waste is
>>     expected in these solutions. It might be beneficial to just
>>     ignore the ID mapping for a start and do it naively. This also
>>     establishes a model which can be run in unison with other models
>>     via QuickCheck so we can check for agreement with a simpler model.
>>
>>     Going to billions of RDF triples will break the above scheme. I'd
>>     probably do like the Cayley datastore in Go: store data in
>>     eleveldb and steal the Cayley-format. In this format you store
>>     everything in one leveldb database where different parts of the
>>     storage are mapped by different prefixes (IDs are <<"z",
>>     SHA1:160/integer>> for instance. To run fast SPO, OSP, POS ...
>>     indexed hashes are stored for these mappings as well. In addition
>>     a horizon and history is tracked for each quad so we can answer a
>>     query in the past faithfully if needed. Sharding/Batching might
>>     come in handy here at a later point.
>>
>>     Leveled trees such as the one leveldb uses tend to be quite
>>     efficient compared to B+-trees in many situations. It might be in
>>     this case too. Because Cayley uses unique SHA1 hashes for data,
>>     we obtain that keys with the same subject will be iterable
>>     because they will have the same prefix in that part of the data
>>     store. We essentially pay with writes up-front to make the
>>     lookups faster. Also note that leveldb will naturally keep often
>>     used data in memory if tuned.
>>
>>     TL;DR: either data fits in memory or it doesn't :P
>>
>
>     Yeah, there dozens of possible solutions. Bear in mind that if you
>     implement RDF on top of Postgres then you need to implement SPARQL
>     on top of SQL. This is no longer an Erlang database. You can
>     implement Riak or Couch like database on top of Postgres. The
>     question is, why? The main reason for using Erlang is to be able
>     distribute the database across multiple cores and nodes. Postgres
>     as an SQL database scales very well but up to some point. After a
>     terabyte of data or so it starts breaking down. You can extend the
>     performance with some tricks but that's it.
>
>     It's not as much about finding a solution that is good enough
>     (Postgres, ETS) but about designing an algorithm that allows to
>     store and query triples in a way that can scale into thousands of
>     nodes. Erlang is perfect for that because you can start with
>     implementing an algorithm that works with a few thousand of
>     processes on multiple CPU cores on one computer. If that works
>     then the same algorithm can transparently be extended to run those
>     processes across multiple nodes. And if that works, the algorithm
>     can probably run millions of processes on those nodes.
>
>     You can compare this problem to a choice between a single big and
>     complex solution (Postgres/SQL or Cray/mainframe) and many small
>     distributed and cooperating nodes (Riak/NoSQL or a mesh of
>     interconnected PCs).
>
>     RDF is by design a way of representing data that is most capable
>     of distributing across nodes (distributing table cells rather than
>     whole columns or tables). In such an environment the biggest
>     problem, in my opinion, is to efficiently handle the LIMIT and
>     ORDER query modifiers. Imagine having to sort a billion of RDF
>     triples across a thousand of nodes just to return 10 triples with
>     the least values/objects.
>
>     GrzegorzJ
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170926/fe0055d6/attachment.htm>