[erlang-bugs] Re: Mnesia deadlocks (?) while loading table

Mon Jan 25 14:00:55 CET 2010

Here is a patch:

diff --git a/lib/mnesia/src/mnesia_tm.erl b/lib/mnesia/src/mnesia_tm.erl
index 3f3a10a..5a2407d 100644
--- a/lib/mnesia/src/mnesia_tm.erl
+++ b/lib/mnesia/src/mnesia_tm.erl
@@ -1388,7 +1388,9 @@ multi_commit(sync_sym_trans, Tid, CR, Store) ->
     {WaitFor, Local} = ask_commit(sync_sym_trans, Tid, CR, DiscNs, RamNs),
     {Outcome, []} = rec_all(WaitFor, Tid, do_commit, []),
     ?eval_debug_fun({?MODULE, multi_commit_sym_sync},
-		    [{tid, Tid}, {outcome, Outcome}]),
+		    [{tid, Tid}, {outcome, Outcome}]),
+    [?ets_insert(Store, {waiting_for_commit_ack, Node}) ||
+	Node <- WaitFor],
     rpc:abcast(DiscNs -- [node()], ?MODULE, {Tid, Outcome}),
     rpc:abcast(RamNs -- [node()], ?MODULE, {Tid, Outcome}),
     case Outcome of

/Dan

On Mon, Jan 25, 2010 at 8:39 AM, Dan Gudmundsson <dangud@REDACTED> wrote:
> Thanks, I'll have a look at it.
> I'm on a hunt for a deadlock I can't reproduce...
>
> /Dan
>
> On Sun, Jan 24, 2010 at 5:48 PM, Igor Ribeiro Sucupira <igorrs@REDACTED> wrote:
>> It's not so easy as I thought to reproduce the problem with only one table.
>> I am attaching code that creates more tables and executes transactions
>> with two of them. This is not a necessary condition to reproduce the
>> problem, but it helps a lot.
>>
>> Open 2 terminals and start one node on each:
>> erl -sname test1
>> erl -sname test2
>> Create everything from the first node (substitute igorrs with your
>> server's name):
>> (test1@REDACTED)1> load_dl:start(test2@REDACTED).
>> Run this restart loop on second node:
>> (test2@REDACTED)1> load_dl:restart_forever().
>>
>> At some point, the second node will stop printing and the last message
>> will be "Started. Waiting for tables." It will be hung forever, not
>> being able to load the tables. At the same time, the current
>> transaction on the first node will also be hung forever (I got some
>> info - see below - by running some RPCs from a third node).
>>
>> Igor.
>>
>> On Sun, Jan 24, 2010 at 5:51 AM, Igor Ribeiro Sucupira <igorrs@REDACTED> wrote:
>>> I have been able to reproduce this some times on a 64-bit Ubuntu, a
>>> 64-bit CentOS and a 32-bit Ubuntu, running R13B02.
>>>
>>> Open 2 terminals and start one node on each:
>>> erl -sname test1
>>> erl -sname test2
>>> Create the schema from the first node (substitute ijaba2 with your
>>> server's name):
>>> (test1@REDACTED)1> ok = mnesia:create_schema([node(), test2@REDACTED]),
>>> mnesia:start().
>>> Start Mnesia also on the second node:
>>> (test2@REDACTED)1> mnesia:start().
>>> Create a test table from the first node and start writing to it:
>>> (test1@REDACTED)2> mnesia:create_table(test, [{disc_only_copies,
>>> mnesia:system_info(running_db_nodes)}]).
>>> (test1@REDACTED)3> W = fun(F, N) -> mnesia:sync_transaction(fun
>>> mnesia:write/1, [{test, N, N}]), F(F, N + 1) end, W(W, 1).
>>> Restart Mnesia on the second node and wait for the table to be loaded:
>>> (test2@REDACTED)2> mnesia:stop(), ok = mnesia:start().
>>> (test2@REDACTED)3> mnesia:wait_for_tables([test], infinity).
>>>
>>>
>>> For some runs of this experiment, the table will never load.
>>> It seems the current writer transaction on the first node is waiting
>>> for a commit, while holding a write lock:
>>> (test3@REDACTED)1> rpc:call(test1@REDACTED, mnesia, system_info, [held_locks]).
>>> [{{test,14691},write,{tid,14696,<6217.41.0>}}]
>>> (test3@REDACTED)3> rpc:call(test1@REDACTED, erlang, process_info,
>>> [list_to_pid("<6217.41.0>"), current_function]).
>>> {current_function,{mnesia_tm,rec_all,4}}
>>>
>>> The second node is waiting for the table to be received
>>> (test3@REDACTED)4> rpc:call(test2@REDACTED, mnesia, system_info,
>>> [held_locks]).
>>> [{{schema,test},read,{tid,14009,<6358.199.0>}}]
>>> (test3@REDACTED)5> rpc:call(test2@REDACTED, erlang, process_info,
>>> [list_to_pid("<6358.199.0>"), current_function]).
>>> {current_function,{mnesia_loader,wait_on_load_complete,1}}
>>>
>>> And there's a third transaction going on (the table sender?):
>>> (test3@REDACTED)7> rpc:call(test1@REDACTED, mnesia, system_info, [transactions]).
>>> [{14696,<6217.41.0>,coordinator},
>>>  {14697,<6217.176.0>,coordinator}]
>>> (test3@REDACTED)8> rpc:call(test1@REDACTED, erlang, process_info,
>>> [list_to_pid("<6217.176.0>"), current_function]).
>>> {current_function,{timer,sleep,1}}
>>>
>>> None of the 3 transactions makes any progress, so maybe there's a
>>> circular waiting here (deadlock).
>>>
>>> I hope you can reproduce it easily. It seems to depend on whether the
>>> first node notices test2's restart before or after test2 starts
>>> loading the table. But maybe I'm wrong and there's another race
>>> condition.
>>>
>>> Igor.
>>>
>>> --
>>> "The secret of joy in work is contained in one word - excellence. To
>>> know how to do something well is to enjoy it." - Pearl S. Buck.
>>>
>>
>>
>>
>> --
>> "The secret of joy in work is contained in one word - excellence. To
>> know how to do something well is to enjoy it." - Pearl S. Buck.
>>
>>
>> ________________________________________________________________
>> erlang-bugs mailing list. See http://www.erlang.org/faq.html
>> erlang-bugs (at) erlang.org
>>
>