[erlang-questions] Re: Conceptual questions on key-value databases for RDBMs users

Fri Nov 12 17:29:32 CET 2010

> That's quite a revelation, for someone (me) whose exposure to NoSQL
> (KV stores) is purely theoretical so far. To me that promise of
> "scaling easily" was the single biggest motivation, i.e. scaling out
> in a large distributed cluster environment.

A more accurate description would be "scale easIER" than other camps as  
opposed to "scale easILY". This is not just semantics. Scaling databases  
is an inherently difficult problem. The NoSQL camp haven't just discovered  
some new miracle cure.

Scalability can can be made easier or harder by the design decisions of  
_BOTH_ the implementor and user of the db (and there are always trade  
offs.) It cannot be made easy (with no trade offs.)

> Could you elaborate on
> that point, and share some instances, scenarios, examples of what may
> be the scaling pitfalls ?

My issue with the promise of easy scaling is that it is usually worded to  
imply that you can take an application that is *designed* for local data  
access to a NoSQL db, sprinkle some NoSQL scalability pixie-dust on it,  
and suddenly it will run on google's world-wide network. There's usually  
no consideration given to the fact that scalability needs to be designed  
into the application *itself* from the word go, not just be made available  
by the db and there are trade offs involved.

To elaborate...

----
Caveat: The following is based on *my* experience as an SQL-RDBMS ->  
NoSQL/kv-db migrant who has tried a number of dbs in both camps. I'll  
deliberately not name names here to avoid flames. You can easily  
investigate further and imply which stores fall where (according to lay  
me). I speak for myself as a db user and make the rather large assumption  
that I *may* be speaking for many others like me. I am NOT an expert in  
distrusted systems.
----

My understanding of scaling in kv-stores/dbs/document-stores (I don't even  
know what to call these anymore) is rooted in three common  
requirements/wishes that an RDBMS migrant goes to bed dreaming about and  
decides to make the NoSQL boat-trip to the promised land. Usually, you  
look into scaling out when you want either...

(A) DISTRIBUTION (don't know if this is the correct term, table-splitting  
perhaps)

What you want: The same class of documents/values stored but not  
duplicated across several databases (usually on different machines in the  
same location). e.g I've got a gazillion orders and they won't fit on one  
machine or render the database very ineffective when querying them, so I  
want them distributed round-robin or some other way using a cluster of  
dbs/machines.

What you expect: When you do this, your application doesn't need to know.  
When updating, the db will decide where things should go (say, based on  
some configuration setting). Queries just get executed across the  
different databases and the result is given to you. Your code doesn't have  
to change (I think Mnesia can do this with fragmented tables?)

What you get: Very few give you the full Mnesia style. Some give you none.  
Some give you more. Many CANNOT provide this 'table-splitting' feature AT  
ALL and instead offer clone replication (below) which is different. Worst  
case, you have to do extra work to expressly decide where to save data and  
run the queries across the different databases then manually combine the  
results. Your code normally has to change. Your code normally gets more  
complicated.

(B) REPLICATION

What you want: the same class of documents/values stored and  
duplicated+synced across several databases (usually on different machines  
in different locations) e.g I've got several branches of my business and I  
want a receipt that appears at my Melbourne branch to appear in my Sydney  
head office so I can export it to the central general ledger.

What you don't expect: That you are ignorant on how the CAP Theorem and  
strong vs eventual consistency will impact on how you write your code.

What you get: Code that's written in absence of the informed design  
decisions required to make eventual consistency work for you. This code  
has to be re-thought and re-written when you finally decide to replicate.  
This could be very large chunks of your application especially for someone  
coming from the SQL-RDBMS world who is used to ACID/release consistency  
and awakens to the fact that several assumptions in your application  
suddenly break when one moves to BASE/eventual consistency. ACID(ic) local  
+ BASE(ic) remote = big an impact on your application design+coding. You  
can't just paste this into your code later.

(C) MORE BEEF

What you want: queries take forever, your server is way too stressed.

What you expect: Throw more hardware at the problem (more cores, memory,  
etc). The more you throw the better things get.

What you get: This seems to be the one area where NoSQL dbs consistently  
deliver. Some exhibit almost linear improvement. I haven't scientifically  
benchmarked this myself, but I have noticed with rudimentary tests using  
different machines I have at my disposal. I understand though that you do  
arrive to a saturation point where the law of diminishing returns sets in  
(so I've read), someone else would be better at explaining why. That is  
something people neglect to mention though.

IN THEIR DEFENSE:

Expectations are probably not the fault of the NoSQL database. It's just  
something that one is led into believing by the NoSQL chatter on the  
internet. That things just magically scale with little effort. That all  
the well-documented and well-researched problems with distrusted  
programming/storage are just "somehow taken care of" by the brilliance of  
NoSQL. The interpretation is that the application programmer doesn't need  
to do anything. Doesn't need rethink local vs distributed/replicated data  
for his application. I've found this to be misleading.

It's certainly less effort than the alternatives, and certainly works  
*far* *far* better than alternatives once you get it working, BUT effort  
and planning *is* required. These things are probably obvious for someone  
used to working in that world so are rarely mentioned.

There's no scalability check-box that you just flip. Things have to be  
well thought out and consequences have to be planned for and code written  
in awareness and many times even to implement some aspects. SQL-RDBMS  
migrants are rarely prepared for this. They are used to the db making all  
the decisions for them. "Dig in then just scale up later when you need to"  
is unlikely work.

Light reading up on elementary CAP theorem, eventual consistency and the  
associated trade-offs is essential for an SQL-RDBMS -> NoSQL migrant who  
is unlikely to have even considered/needed to consider these things in the  
past. These will greatly affect how you do your part and harmonise it with  
how the db in question does it's part to get the best (but _never_ ideal)  
situation.

THE DISCLAIMER:

Should read: "EasIER scaling BUT the onus is on you to ensure you  
understand the impact of the CAP theorem and eventual consistency many  
have and investigate any extra work that may be required for the database  
in question BEFORE you design let alone code your application".

For the NoSQL camp it might be be like "Duh!" But IMO, these things are  
not obvious for the RDBMS migrant who has been promised easy scaling from  
a bullet list of features. This renders scalability not easy, just more  
achievable, with trade-offs.

- Edmond -

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

On Fri, 12 Nov 2010 14:21:22 +1100, Icarus Alive <icarus.alive@REDACTED>  
wrote:

> On Tue, Nov 9, 2010 at 7:08 PM, Edmond Begumisa
> <ebegumisa@REDACTED> wrote:
>>
>>> - KV stores scale easily
>>
>> Largely true. But sometimes it's not as easy as they promise. Some  
>> things
>> work well locally and break when you replicate. And you only realise  
>> this
>> when you try and scale out. There is some element of false advertising  
>> when
>> NoSQL folk promise simple scaling, there should be a big red disclaimer
>> attached.
>
> That's quite a revelation, for someone (me) whose exposure to NoSQL
> (KV stores) is purely theoretical so far. To me that promise of
> "scaling easily" was the single biggest motivation, i.e. scaling out
> in a large distributed cluster environment. Could you elaborate on
> that point, and share some instances, scenarios, examples of what may
> be the scaling pitfalls ?
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/