Billion-triple store

Wed May 3 03:23:44 CEST 2006

On 03/05/2006, at 7:30 AM, Leif Johansson wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>>
>> Well as the lead maintainer of kowari, I would be very happy to  
>> discuss
>> any requirements you might have, and see if we can't help you.
>
> My application is a sink datastore for an enterprise message-bus. My
> message-bus is distributed and collects event-based data from a  
> variety
> of sources including syslog (from 1000s of clients), hr-systems and
> student administration systems (for identity-management), LDAP
> directories, computer-telephony integration servers and network-
> management servers. I want to collect all messages in a datastore
> normalized as RDF. The datastore can quite possibly be distributed
> and must be able to support a sustained rate of 1000s of insertions
> of small RDFs per second.
>
> How do you like them requirements :-)

Large scale, multi-format, multi-schema, information aggregation,  
analysis, and inference is our primary focus.  We would be very  
interested in any experiences you have using kowari in this area, as  
this sort of thing is exactly why we built it.

>> Currently the largest scalability test I am aware of for kowari was
>> 500million, but those results indicated that we hadn't reached our  
>> limit
>> yet.  One of the store-layers designers did some calculations that
>> indicate that we should be able to scale to 1-2billion without
>> difficulty; although as one of the primary developers of the query  
>> layer
>> I am aware of some bottle necks that are likely to interfere with any
>> queries requiring extremely large intermediate results (~1e6 tuples).
>
> I looked at http://esw.w3.org/topic/TripleStoreScalability where the
> recorded claims are a bit lower... but maybe that site is out-of-date.

At http://idealliance.org/proceedings/xtech05/papers/04-02-04/ you  
can read about the largest published test which was 250M.  I am aware  
of internal tests prior to Tucana's closure that went higher, but  
they were never documented sufficiently to be published.  At the time  
these tests were run http://www.ontotext.com/kim/performance.html was  
reporting 15M triples for sesame and we were having difficulty  
convincing people that our claims of 250M weren't vapourware.

An added problem of course was our support for Jena, which simply  
never scaled.  Because we supported jena, almost everyone who  
compared kowari with another triple store ignored our warnings, and  
just reran their jena code against kowari.  The result was always  
abysmal as, at the time, jena was designed assuming cheap (ie memory  
backed) random access to triples in the store.  We have since  
deprecated and removed Jena support from kowari.

>> At the same time, there are plans to address these issues, and to  
>> break
>> the scalability bottle necks that are preventing us reaching 1e10 and
>> 1e11 at the moment, these include promising prototypes of a new store
>> design to improve locality and throughput that should result in us
>> scaling comfortably to 1e10.
>
> I am quite interested in update scalability too...

update?  The concept isn't particularly relevant to RDF.  You have  
insert, read,  and delete, but individual statements don't have  
identity so you can't update them.  This isn't a kowari thing, it's a  
property of RDF inherited from description logic.

Andrae

-- 
Andrae Muys
andrae@REDACTED
Principal Kowari Consultant
Netymon Pty Ltd