[erlang-questions] Re: Conceptual questions on key-value databases for RDBMs users
Edmond Begumisa
ebegumisa@REDACTED
Tue Nov 9 14:38:37 CET 2010
Hello Rudolph,
Some comments...
On Tue, 09 Nov 2010 20:43:14 +1100, Rudolph van Graan
<rvg@REDACTED> wrote:
>
>> One of the core concepts in relational systems is that of
>> enforced integrity constraints, via primary keys and foreign keys.
>> If you don't have any integrity constraints that you care to tell the
>> data base about, you probably don't need a relational system.
>
> This was exactly what I meant with "invariance" earlier. If you have
> (strong) requirements for invariance in your model - an RDBMS is almost
> the only solution as integrity constraints etc are all first class
> concepts. If you don't have a requirement for invariance (as in Edmund's
> example):
>
In many cases (at least for a migrating RDBMS user), invariance is
*usually* indeed very important. And as you pointed out earlier there are
ways of faking this in the emerging class of NoSQL/kv-db, the absence of
which means tougher checks on code that's writing data (the db is unlikely
to do this for you.)
I've found that when one is making the transition (from SQL-RDBMS to
NoSQL/Kv-db), the benefits of easier mapping of db-object to
application-object normally out-weight the loss of things like
constraints. I guess you have to trust more that programmers know what
they are doing and won't write bad data! Some people cringe when hearing
this, but I've found it's as bad as it sounds!
However, I think as this newer class of DB start to attract more and more
RDBMS migrants, they will have to figure out a ways of solving these
issues. Some have already stared (CouchDB has some concept of data
validation.)
>> It is important to understand that SQL is not a good example of a
>> relational system.
>
> Yes - SQL is a functional language where you state the solution in terms
> of relational concepts and ask the RDMBS to solve it and present you
> with an answer. Some of the Non-SQL systems uses search terms "Name =
> 'John'" or jscript for the same purpose. QLC is also an example (in the
> case of mnesia).
>
I'd go further and say SQL has evolved into a primary API for accessing
todays RDBMSs. You can do much more that just query for data. You can
design/redesign the db, add/remove indicies. Many vendors add elaborate
extensions like Microsoft's Data Shaping for hierarchical data. In many
ways SQL is like the non-standard standard shell language for many
databases.
>> I think a big reason kv-stores are winning over a lot of us long-time
>> RDBMSs users is they allow us to model
>> things-that-have-things-inside-them in the database much closer to how
>> they are modeled in our applications. Orders/receipts/invoices with
>> their items, users with their permissions, all these map nicely in
>> kv-stores from db to application. This allows us to model relationships
>> only when WE REALLY WANT RELATIONSHIPS (this receipt belongs to that
>> invoice). That fact alone won me over and I've never looked back.
>
> However, it is only simple to store things this way (the example of
> Orders/Receipts/Invoices with items "inside" them),
I disagree, I store things this way not because it's simple for the
database but because that's how my applications understand those things
(BTW, those were *REAL* examples I was using orders/receipts/invoices from
real code in development. Code previously written against SQL-RDBMSs and
being re-writing against CouchDB. The latter is proving to be a much more
pleasant experience!)
I've seen many SQL-RDBMS developers try to achieve a similar end-result
using object-relational mapping. You know, so when they create an instance
of a user class in say Java, a new user is automatically added the users
table in the db and corresponding entries are made in the permissions
table. I've never agreed with that way of doing things but it *does*
illustrate that there is a disconnect between how developers want to
organise data and how SQL-RDBMSs want them organise data.
> if your only interest in the data is that the "outer" or container
> object encapsulates the items.
This isn't true. At least not with the class of kv-database I'm referring
to. CouchDB (and I assume Mnesia) allows you to access the 'inner' objects
(to whatever depth) inside your queries. You can very easily access the
items inside an invoice.
> Typically you want to read or write the whole thing in one operation. In
> real life, (and in my experience), you will pretty soon find that
> someone wants to know how many orders during the last 60 days included
> item X with quantities larger than say 6.
>
Easy. I do that sort of thing everywhere (no accounting tool would be very
helpful if it couldn't do those kind of complex queries).
I don't know about query engines for other kv-dbs like Mnesia, but CouchDB
handles that sort of thing very well. Just create a view that gathers the
inner data according to the criteria you are looking for (map step) then
do the necessary tallying (reduce step).
> If your design decision was to store this whole thing (the order and its
> items) as one big document (my term for it), the only way to retrieve
> this data is to literally open up every order, filter out the items you
> want and aggregate them. The only way to make this fast is to avoid
> reading every single document and processing it over and over.
Couch is very smart at ensuring that any lengthy read/map-reduce
operations are cached in an internal data structure (B-tree I think) and
tries it's best to keep the cache update-to-date. So you normally don't
have to worry too much about the whole vs partial read/write operations --
you just get on with business of writing your application using whatever
structures make sense to you. But if the query is one that is rarely
executed and lengthly, you can run it at startup so the next time it runs
it's faster.
This is one of the things I mean when I say we have to make clear the
class of kv-store that the original poster should look into. An RDBMS user
looking for a suitable alternative isn't looking for just a plain
kv-store. You're normally looking for something that has a nice well
thought-out query engine on top and has figured a lot of these issues out.
The Couch team have done an *excellent* job here.
> And to do this optimisation, you need an index, or more likely several
> indices - on dates, on items types, etc.
Correct. For couch's case, when you're creating a query (a view in couch
lingo) what you're essentially doing at the map step is asking couch to
create an index for you. Couch tries it's best to be very smart at how it
handles/optimises that index.
> Indices require that you understand what is inside your "document" (in
> this case line items).
Correct. You can look at couch's map step as providing the query engine an
understanding of the documents being stored, including any contained or
deeply hierarchical* data.
*YAY! I've _ALWAYS_ wanted this and tried desperately to simulate it in
SQL-RDBMSs with varying degrees of success.
> By definition, this implies a relationship - orders have among other
> things - lines. Completely independent of the fact that you are storing
> the items inside the document/order.
>
Okay, I see what you're saying here. Indeed...
http://wiki.apache.org/couchdb/EntityRelationship
... you can view embedded data as related data. I hadn't thought of it
that way.
Let me make my argument clear by referring to what SQL-RDBMS folk tend to
think of as a relationship -- that is -- a constraint based on a
foreign-key. The kv-db equivalent of this would be a member of the
document that is merely a key pointing to another document rather than
meaningful data in it's own right. Now, using *that* definition...
For Couch's case, the map-reduce paradigm can be effectively used so that
you *DON'T HAVE TO CREATE RELATIONSHIPS (of the external key variety) IF
YOU DON'T WANT THEM*
This is my major argument against SQL-RDMBSs and in favor of NoSQL/kv-dbs
like Couch -- give me containers when I want containers and give me
relationships when I specifically want relationships.
I don't want my orders to *relate* to lines elsewhere, this is not how my
application looks at orders. I want orders to *contain* items. On the
other-hand, I want invoices to *relate* to orders. Only in the latter case
does it make sense for me to have and foreign-key/id reference stored in
the orders. I want data in my db to look like data in my application.
The db should fit my application, my application shouldn't have to fit the
db. I always found I would invest considerable time with SQL-RDBMSs, first
trying to force the db to understand how I want my data organised, then
giving up and just adapting the code of my app to think like the db -- in
terms of tables and foreign keys.
I've found NoSQL dbs like Couch treat me more like a grown up. Give me the
freedom to define and enforce my own rules in a way that makes sense to me
not in some generic way the *supposedly* adapts to every situation. Sure,
the RDBMS style may adapt to a wide array of situations, but from my
experience, this is never without pain.
> So as a summary from my side - all data has some sort of structure, be
> it words within documents, or line items within orders. You can
> represent this any way you want.
>
True, but with most SQL-RDBMSs, things can get particularly hard. Aside
from my relationship rant, a lot of data just doesn't want to be stored in
tables, and when you force it, exponential complexity arises.
> In the distant past we wrote all the items on a single piece of paper
> called an order. It was all on one physical page. The page contained all
> the information. For the same reason it is difficult to query pieces of
> paper (you need to either index them or summarise them in another way),
> in the same way it is difficult to query data with implied relations
> stored in a single "thing" (blob/object/values).
>
> - It is very difficult to enforce invariance in KV stores
I agree. And constraints in general are painful.
> - It is very difficult to index KV stores
This depends. For couch, map-reduce is a form of indexing -- a very
effective and surprisingly flexible one at that. Doors that are normally
closed in SQL-RDBMS world are suddenly opened wide in Couch.
> - It is hard work to query KV stores.
I disagree. I haven't found a query for *my* data yet that I cannot
express in a couch view. I'd go further and say I've found it easier than
SQL because I can organise data in a way that makes more sense to my
application. This somehow always translates to easier query construction
for the application domain.
Perhaps it's a question of learning curve. SQL-RDBMS users are so used to
thinking in terms of SELECTS and JOINS, but that doesn't make SQL easy! It
just makes it easy once you know how. I think the same applies to many
kv-dbs.
> - It is trivial to read from or write into KV stores
> - It is hard to read from or write to database (drivers, SQL, ...)
> - RDMBS systems are hard to scale
I agree with all.
> - KV stores scale easily
Largely true. But sometimes it's not as easy as they promise. Some things
work well locally and break when you replicate. And you only realise this
when you try and scale out. There is some element of false advertising
when NoSQL folk promise simple scaling, there should be a big red
disclaimer attached.
- Edmond -
> Rudolph van Graan
>
>
> On Nov 9, 2010, at 1:59 AM, Richard O'Keefe wrote:
>
>>
>> On 8/11/2010, at 4:06 AM, Steve Davis wrote:
>>> It appears to me that this discussion is another expression of the
>>> 'strong vs weak/dynamic vs static type' discussion.
>>>
>>> ...it makes me suspect that an imperative and strongly-typed language
>>> paradigm has been a very strong motivator in the evolution of SQL
>>> databases; and perhaps the popularity of NoSQL/NotSQL is an expression/
>>> outcome of the rise of recent trends in programming language uptake.
>>
>> You *cannot* call the types in classic SQL "strong".
>> Numbers, strings, and byte strings for everything is what Joe is
>> complaining
>> of and he is right to do so. Encoding something as a string or blob
>> basically
>> results in the data base itself having no clue about the structure or
>> meaning
>> of the data.
>>
>> It is important to understand that SQL is not a good example of a
>> relational
>> system. A move away from SQL *could* be a move towards relational!
>> such as
>> One of the core concepts in relational systems is that of
>> enforced integrity constraints, via primary keys and foreign keys.
>> If you don't have any integrity constraints that you care to tell the
>> data base about, you probably don't need a relational system.
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
More information about the erlang-questions
mailing list