[erlang-questions] Mnesia vs. timeouts

Thu Sep 2 17:08:57 CEST 2010

2010/9/2 Håkan Mattsson <hm@REDACTED>:
> cupira <igorrs@REDACTED> wrote:
>> Hello.
>>
>> I've recently had a problem with a pool of servers running Mnesia.
>> The clients reach the pool through a load balancer and the tables
>> (most of which are disc_only_copies) are fragmented and replicated
>> throughout the servers. This means that Mnesia, on each server
>> handling a request, will usually need to contact other servers.
>
> This is a rather unfortunate setup that does not scale. In order to make
> this to scale, you would need to try to achieve that each request only
> should access local replicas.

By replicas, do you mean Mnesia clones? Since I'm using replication
also for data safety, I can't see a "local" scenario that would make
sense. Even if you have two nodes in the same server, each one writing
a clone to a different external storage, that would still be
non-ideal. Besides, this scenario wouldn't be free of the problems of
cloning to two different machines, since one of the external storages
might still be overloaded some day.
That said... replication was not really the problem in the situation I
described, since the huge majority of the operations were dirty reads
(not writes).

If what you meant was actually fragments (not replicas), you would be
saying that I shouldn't run distributed activities through Mnesia (?).
Actually, I can avoid distributed activities for my simple dirty
reads: just check with Mnesia which nodes hold the data and then
contact one of them via RPC. I'm planning that, indeed. But if you
tell people they can't run distributed activities, Mnesia starts to
become less useful.

The third possibility is: you meant the clients should also reach
Mnesia by the exact server that holds all the necessary data. In this
scenario, Mnesia would be almost worthless, so I don't think you mean
that.

I'll post again today, to discuss some solutions.

Thanks.
Igor.

> Mnesia has a concept of foreign keys
> that can be used to co-allocate fragments from different tables.
>
>> The problem happened when the number of requests from the clients was
>> suddenly multiplied by 100 during some seconds. The kernel (system)
>> CPU time immediately reached the top on every server and the clients
>> began to timeout and later retry the failed requests (what worsened
>> the problem, of course).
>> From the logs in the servers, I noticed some dirty reads taking more
>> than 10 minutes to finish (way after the clients had given up on those
>> operations).
>>
>> My question is: how can I set a timeout for every Mnesia activity
>> (which may be distributed) and make sure that, after that time, no
>> operation related to that activity will be left hanging on any node?
>> By just killing the process that called mnesia:activity, am I
>> guaranteed to get that result?
>
> Mnesia has no such timeout. I would not recommend killing the Mnesia
> related processes, even if Mnesia is designed to cope with that. The core
> problem seems to be that your application does not scale. Killing processes
> in panic does not solve the real problem.
>
> /Håkan

-- 
"The secret of joy in work is contained in one word - excellence. To
know how to do something well is to enjoy it." - Pearl S. Buck.