[erlang-questions] Tuples referencing each other problem

Tue Sep 26 19:06:09 CEST 2017

The reason I mention Postgres is because it takes only a couple of days to
build a useful solution based on it. Most meta-data storage I've seen
easily fits within 1 Terabyte of memory, which means for 99% of all use
cases it tends to be enough.

A distributed solution is going to measured in weeks or months to build. If
you want to go that route, I think LevelDB on individual nodes, riak_core
for handling a hash ring and some of the ideas from facebook's TAO-paper
for handling secondary key garbage collection could be a rather scalable
solution. This could probably scale into the petabyte range if you add
enough machines to the ring. But these are Google/Facebook data sizes.

On Tue, Sep 26, 2017 at 12:25 AM Grzegorz Junka <list1@REDACTED> wrote:

>
> On 25/09/2017 20:53, Jesper Louis Andersen wrote:
>
> On Mon, Sep 25, 2017 at 9:16 PM Grzegorz Junka <list1@REDACTED> wrote:
>
>>
>> gb_tree would not prevent from having to store the key twice (once as the
>> key for gb_tree and once as the value). Not sure why you mention gb_tree
>> here?
>>
>>
> The reason I mention this is because if we have:
>
> Key = T, % for some arbitrary term T
> Id = Horizon, % for some horizon value
> Tree2 = gb_trees:enter(Key, Id, Tree1),
> Tree3 = gb_trees:enter(Id, Key, Tree2),
>
> then the `Key` is immutable and thus shared. It will in practice be of
> pointer size (4-8 bytes). This only holds if you are in a gb_trees setting
> because if we start going to ETS for this, then we obviously might lose the
> sharing. Perhaps it does work if you insert multiple entries at once.
>
> That's the whole point. How would you implement an RDF/triple database in
>> Erlang? :)
>>
>>
> The obvious candidate to beat is Postgres with an appropriately designed
> table structure and indexes. If we can't beat this, we are not even in
> game. It is also fairly easy to write, so first, we should use that as a
> baseline implementation.
>
> If our RDF store is fairly small and can fit in memory, I'd probably just
> store one ETS table doing a bidrectional ID<->Key mapping as above. And
> then, depending on query needs store those ID's as {{spo, Subject,
> Predicate}, Object} or such in ETS. This should allow fairly efficient
> query, as most of the data resides in memory and lookup is going to be
> fast. Some data waste is expected in these solutions. It might be
> beneficial to just ignore the ID mapping for a start and do it naively.
> This also establishes a model which can be run in unison with other models
> via QuickCheck so we can check for agreement with a simpler model.
>
> Going to billions of RDF triples will break the above scheme. I'd probably
> do like the Cayley datastore in Go: store data in eleveldb and steal the
> Cayley-format. In this format you store everything in one leveldb database
> where different parts of the storage are mapped by different prefixes (IDs
> are <<"z", SHA1:160/integer>> for instance. To run fast SPO, OSP, POS ...
> indexed hashes are stored for these mappings as well. In addition a horizon
> and history is tracked for each quad so we can answer a query in the past
> faithfully if needed. Sharding/Batching might come in handy here at a later
> point.
>
> Leveled trees such as the one leveldb uses tend to be quite efficient
> compared to B+-trees in many situations. It might be in this case too.
> Because Cayley uses unique SHA1 hashes for data, we obtain that keys with
> the same subject will be iterable because they will have the same prefix in
> that part of the data store. We essentially pay with writes up-front to
> make the lookups faster. Also note that leveldb will naturally keep often
> used data in memory if tuned.
>
> TL;DR: either data fits in memory or it doesn't :P
>
>
> Yeah, there dozens of possible solutions. Bear in mind that if you
> implement RDF on top of Postgres then you need to implement SPARQL on top
> of SQL. This is no longer an Erlang database. You can implement Riak or
> Couch like database on top of Postgres. The question is, why? The main
> reason for using Erlang is to be able distribute the database across
> multiple cores and nodes. Postgres as an SQL database scales very well but
> up to some point. After a terabyte of data or so it starts breaking down.
> You can extend the performance with some tricks but that's it.
>
> It's not as much about finding a solution that is good enough (Postgres,
> ETS) but about designing an algorithm that allows to store and query
> triples in a way that can scale into thousands of nodes. Erlang is perfect
> for that because you can start with implementing an algorithm that works
> with a few thousand of processes on multiple CPU cores on one computer. If
> that works then the same algorithm can transparently be extended to run
> those processes across multiple nodes. And if that works, the algorithm can
> probably run millions of processes on those nodes.
>
> You can compare this problem to a choice between a single big and complex
> solution (Postgres/SQL or Cray/mainframe) and many small distributed and
> cooperating nodes (Riak/NoSQL or a mesh of interconnected PCs).
>
> RDF is by design a way of representing data that is most capable of
> distributing across nodes (distributing table cells rather than whole
> columns or tables). In such an environment the biggest problem, in my
> opinion, is to efficiently handle the LIMIT and ORDER query modifiers.
> Imagine having to sort a billion of RDF triples across a thousand of nodes
> just to return 10 triples with the least values/objects.
>
> GrzegorzJ
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170926/fea58534/attachment.htm>