[erlang-questions] gproc : 2 nodes out of sync behaviour.

Morgan Segalis msegalis@REDACTED
Sun Jul 1 17:01:46 CEST 2012


Gproc may make life more interesting, but right now, I certainly know that gproc made my life easier, thanks to you :-)

(Sorry for my late answer, but I wanted to think about the solution before posting it)

When you say : " since gen_leader didn't use to have a way to handle netsplits."

1- this means that gen_leader handles netsplits now ?
2- If so, gproc_dist would only need a way to know when a netsplits happened, right ?

What about this solution ? (if previous 1 & 2 are true)

=============================== nodemonitor.erl ===================================
-module(nodemonitor).
-behaviour(gen_server).

-record(nodemonitor, {nodes=none}).

-export([start_link/0, init/1, handle_call/3, handle_cast/2, terminate/2, code_change/3, handle_info/2]).

start_link() ->
    gen_server:start_link(?MODULE, [], []).

init([]) ->
    net_kernel:monitor_nodes(true, [{node_type, visible}]),
    {ok, #nodemonitor{nodes=dict:new()}}.

handle_call(_, _, NM) ->
    {noreply, NM}.

handle_cast(_, NM) ->
    {noreply, NM}.

handle_info({nodeup, Node, Params}, NM) ->
    case dict:find(Node, NM#nodemonitor.nodes) of
	{ok, disconnected} ->
	    %% Let know gproc_dist about this netsplit
	    io:fwrite("netsplit occured: ~p~n", [Node]);
	{ok, connected} ->
	    io:fwrite("Error occured: ~p~n", [Node]);
	error ->
	    io:fwrite("New Node detected: ~p~n", [Node])
    end,
    Dict = dict:store(Node, connected, NM#nodemonitor.nodes),
    {noreply, NM#nodemonitor{nodes=Dict}};
handle_info({nodedown, Node, Params}, NM) ->
    Dict = dict:store(Node, disconnected, NM#nodemonitor.nodes),
    {noreply, NM#nodemonitor{nodes=Dict}}.

code_change(_, NM, _) ->
    {noreply, NM}.

terminate(normal, _) ->
    ok.
-----------------------------------------------------------------------------------------------------------------------------------------

I'm surely not pretending that this solution would not have been thought by you, so there is something I don't get.
Do you think it would be possible to do something about it ?

3 - if 1 & 2 are not true then would it be possible, in your opinion, to stop & start gproc and re-register every value so every cluster are in sync again ?

Le 1 juil. 2012 à 13:49, Ulf Wiger a écrit :

> It's a feature of gproc, carefully crafted to make life more interesting. ;-)
> 
> There is no resynch after netsplit in gproc, since gen_leader didn't use to have a way to handle netsplits. Still, there is no hook to inform the callback (gproc_dist) about what's happened.
> 
> One way to deal with this, is to set -kernel dist_auto_connect false, and add a "backdoor ping" (e.g. over UDP). If you get a ping from a known node that's not in the nodes() list, you have a netsplit situation. You can then select which node(s) to restart. After restart, normal synch will ensue, and since the nodes never auto-connected, you will have concistency (but quite possibly data loss, of course).
> 
> BR,
> Ulf W
> 
> Ulf Wiger, Feuerlabs, Inc.
> http://www.feuerlabs.com
> 
> 1 jul 2012 kl. 13:36 skrev Morgan Segalis <msegalis@REDACTED>:
> 
>> Hello everyone,
>> 
>> I have 2 nodes which use gproc.
>> Both are well connected to each other…
>> But sometimes (doesn't happen really often, but it does) both server gets disconnected to each other, once their are connected again, gproc is out of sync.
>> 
>> Here's what happen : 
>> 1- A is connected to B.
>> 2- a new value X set by A is saw by B
>> 3- a new value Y set by B is saw by A
>> -------- they get disconnect for a second or two --------
>> 4- Clusters lost connection
>> -------- they reconnect ----------
>> 5- Clusters regain connection
>> 6- the old value X set by A is not saw anymore by B
>> 7- the old value Y set by B is not saw anymore by B
>> 8- a new value Z set by A is saw by B
>> 9- a new value V set by B is not saw by A
>> 
>> how come in "8" the new value Z set by A is saw by B and in "9" a new value V set by B is not saw by A ?
>> I know that there is a leader, which is probably B, but I can't explain why new value are not seen symmetrically.
>> what should I do for reconnecting correctly both cluster, so old value and new value are saw in both cluster again ?
>> 
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list