[erlang-questions] Conceptual questions on key-value databases for RDBMs users

Tue Nov 2 22:52:06 CET 2010

On Tue, Nov 2, 2010 at 9:14 PM, Silas Silva <silasdb@REDACTED> wrote:

> I have used SQL RDBMSs for some time.  I've never used any very advanced
> feature, but I know enough of it to make database applications.
>
> Nowadays, I decided it would be interesting to learn some NoSQL
> databases concepts.  So I decided to pick up some Erlang and Mnesia, its
> native key-value database.  More than scalability itself, the most
> valuable feature for me is the possibility of replication and
> synchronization between nodes.

I'll rant. Beware. SQL is a declarative language, like QLC. It is
quite domain-specific, but it has one distinct advantage: You can
query-optimize on it. Most databases that use SQL relies on a heavily
relational model in which you idealize normalized data. The advantage
is that this support ad-hoc queries very well. If you can dream up the
appropriate SQL-query, you have the answer, though it may take some
time before it comes to you, depending on the power of optimization
and the complexity of the query.

Enter the web. The basic premise is that we do not want users to do
ad-hoc queries on data. Partly due to security and partly to
(inadvertently) denying other user service. So many systems backed by
RDBMS systems artificially lock down the allowed queries to a few
blessed. Now enter Google. Google has one main thing they need to
serve, which is inverted indexes for words. This is a very specific
problem with an interesting property: you can shard the keyspace of
words into multiple machines for good distribution and parallelism.
The limitation is that you just locked down ad-hoc query to specific
queries of specific data, but you can get really good query speed on
those. You also got the sharding capability and the system is not hard
to implement. Data mining can be achieved by batch-runs of map/reduce
over all data. It is in some sense slow, but if your query fits the
M/R scheme, it parallelizes easily. In a certain sense, the M/R gave
you some of the ad-hoc query capabilities back. The final key concept
is that triggers easily work on sharding models. Upon insertion, you
run hooks which can in turn update specific query indexes, drive
full-text-search engines and so on.

Now, most web services out there have modest data storage needs. The
amount of services using the RDBMS as a glorified file system is
abundant and pervasive to an extent which feels pervertedly sick. Such
systems never really had the need for all the niceties of an SQL
system and their queries are simple. Also, the mapping from SQL into
the language of choice is not easy, especially if said language is
object oriented.

Next, the war drums begin to play. The CAP theorem is proven and this
changes the game. Now whenever you do a database system, there is a
tradeoff which must be made. Much misunderstanding of the CAP theorem
is out there. We don't care about he halting theorem that much either,
even though it is very real and there. But it does pose some
limitations to what one can achieve.

Personally, I think the K/V stores grew out of the Google train.
Google did them because no database system was ready for that kind of
scalability. But that should not be taken as if you can never make
SQL, QLC or the like support sharded queries. We have more stores of
the "NoSQL" kind popping up every month which is great for the
innovation. But time will show which of these will actually fly and
which will thud to the ground.

-- 
J.