[erlang-questions] Re: Conceptual questions on key-value databases for RDBMs users
Edmond Begumisa
ebegumisa@REDACTED
Fri Nov 12 17:29:32 CET 2010
> That's quite a revelation, for someone (me) whose exposure to NoSQL
> (KV stores) is purely theoretical so far. To me that promise of
> "scaling easily" was the single biggest motivation, i.e. scaling out
> in a large distributed cluster environment.
A more accurate description would be "scale easIER" than other camps as
opposed to "scale easILY". This is not just semantics. Scaling databases
is an inherently difficult problem. The NoSQL camp haven't just discovered
some new miracle cure.
Scalability can can be made easier or harder by the design decisions of
_BOTH_ the implementor and user of the db (and there are always trade
offs.) It cannot be made easy (with no trade offs.)
> Could you elaborate on
> that point, and share some instances, scenarios, examples of what may
> be the scaling pitfalls ?
My issue with the promise of easy scaling is that it is usually worded to
imply that you can take an application that is *designed* for local data
access to a NoSQL db, sprinkle some NoSQL scalability pixie-dust on it,
and suddenly it will run on google's world-wide network. There's usually
no consideration given to the fact that scalability needs to be designed
into the application *itself* from the word go, not just be made available
by the db and there are trade offs involved.
To elaborate...
----
Caveat: The following is based on *my* experience as an SQL-RDBMS ->
NoSQL/kv-db migrant who has tried a number of dbs in both camps. I'll
deliberately not name names here to avoid flames. You can easily
investigate further and imply which stores fall where (according to lay
me). I speak for myself as a db user and make the rather large assumption
that I *may* be speaking for many others like me. I am NOT an expert in
distrusted systems.
----
My understanding of scaling in kv-stores/dbs/document-stores (I don't even
know what to call these anymore) is rooted in three common
requirements/wishes that an RDBMS migrant goes to bed dreaming about and
decides to make the NoSQL boat-trip to the promised land. Usually, you
look into scaling out when you want either...
(A) DISTRIBUTION (don't know if this is the correct term, table-splitting
perhaps)
What you want: The same class of documents/values stored but not
duplicated across several databases (usually on different machines in the
same location). e.g I've got a gazillion orders and they won't fit on one
machine or render the database very ineffective when querying them, so I
want them distributed round-robin or some other way using a cluster of
dbs/machines.
What you expect: When you do this, your application doesn't need to know.
When updating, the db will decide where things should go (say, based on
some configuration setting). Queries just get executed across the
different databases and the result is given to you. Your code doesn't have
to change (I think Mnesia can do this with fragmented tables?)
What you get: Very few give you the full Mnesia style. Some give you none.
Some give you more. Many CANNOT provide this 'table-splitting' feature AT
ALL and instead offer clone replication (below) which is different. Worst
case, you have to do extra work to expressly decide where to save data and
run the queries across the different databases then manually combine the
results. Your code normally has to change. Your code normally gets more
complicated.
(B) REPLICATION
What you want: the same class of documents/values stored and
duplicated+synced across several databases (usually on different machines
in different locations) e.g I've got several branches of my business and I
want a receipt that appears at my Melbourne branch to appear in my Sydney
head office so I can export it to the central general ledger.
What you don't expect: That you are ignorant on how the CAP Theorem and
strong vs eventual consistency will impact on how you write your code.
What you get: Code that's written in absence of the informed design
decisions required to make eventual consistency work for you. This code
has to be re-thought and re-written when you finally decide to replicate.
This could be very large chunks of your application especially for someone
coming from the SQL-RDBMS world who is used to ACID/release consistency
and awakens to the fact that several assumptions in your application
suddenly break when one moves to BASE/eventual consistency. ACID(ic) local
+ BASE(ic) remote = big an impact on your application design+coding. You
can't just paste this into your code later.
(C) MORE BEEF
What you want: queries take forever, your server is way too stressed.
What you expect: Throw more hardware at the problem (more cores, memory,
etc). The more you throw the better things get.
What you get: This seems to be the one area where NoSQL dbs consistently
deliver. Some exhibit almost linear improvement. I haven't scientifically
benchmarked this myself, but I have noticed with rudimentary tests using
different machines I have at my disposal. I understand though that you do
arrive to a saturation point where the law of diminishing returns sets in
(so I've read), someone else would be better at explaining why. That is
something people neglect to mention though.
IN THEIR DEFENSE:
Expectations are probably not the fault of the NoSQL database. It's just
something that one is led into believing by the NoSQL chatter on the
internet. That things just magically scale with little effort. That all
the well-documented and well-researched problems with distrusted
programming/storage are just "somehow taken care of" by the brilliance of
NoSQL. The interpretation is that the application programmer doesn't need
to do anything. Doesn't need rethink local vs distributed/replicated data
for his application. I've found this to be misleading.
It's certainly less effort than the alternatives, and certainly works
*far* *far* better than alternatives once you get it working, BUT effort
and planning *is* required. These things are probably obvious for someone
used to working in that world so are rarely mentioned.
There's no scalability check-box that you just flip. Things have to be
well thought out and consequences have to be planned for and code written
in awareness and many times even to implement some aspects. SQL-RDBMS
migrants are rarely prepared for this. They are used to the db making all
the decisions for them. "Dig in then just scale up later when you need to"
is unlikely work.
Light reading up on elementary CAP theorem, eventual consistency and the
associated trade-offs is essential for an SQL-RDBMS -> NoSQL migrant who
is unlikely to have even considered/needed to consider these things in the
past. These will greatly affect how you do your part and harmonise it with
how the db in question does it's part to get the best (but _never_ ideal)
situation.
THE DISCLAIMER:
Should read: "EasIER scaling BUT the onus is on you to ensure you
understand the impact of the CAP theorem and eventual consistency many
have and investigate any extra work that may be required for the database
in question BEFORE you design let alone code your application".
For the NoSQL camp it might be be like "Duh!" But IMO, these things are
not obvious for the RDBMS migrant who has been promised easy scaling from
a bullet list of features. This renders scalability not easy, just more
achievable, with trade-offs.
- Edmond -
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
On Fri, 12 Nov 2010 14:21:22 +1100, Icarus Alive <icarus.alive@REDACTED>
wrote:
> On Tue, Nov 9, 2010 at 7:08 PM, Edmond Begumisa
> <ebegumisa@REDACTED> wrote:
>>
>>> - KV stores scale easily
>>
>> Largely true. But sometimes it's not as easy as they promise. Some
>> things
>> work well locally and break when you replicate. And you only realise
>> this
>> when you try and scale out. There is some element of false advertising
>> when
>> NoSQL folk promise simple scaling, there should be a big red disclaimer
>> attached.
>
> That's quite a revelation, for someone (me) whose exposure to NoSQL
> (KV stores) is purely theoretical so far. To me that promise of
> "scaling easily" was the single biggest motivation, i.e. scaling out
> in a large distributed cluster environment. Could you elaborate on
> that point, and share some instances, scenarios, examples of what may
> be the scaling pitfalls ?
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
More information about the erlang-questions
mailing list