[erlang-questions] [ANN] Syn: a global process registry
Michael Truog
mjtruog@REDACTED
Tue Jul 7 22:04:01 CEST 2015
On 07/07/2015 07:25 AM, Roberto Ostinelli wrote:
> Hi Fred,
> Thank you for your input. Comments below.
>
> One of the things mentioned in your article was that because you used mostly unique device names, you didn't have to worry much about conflicts in names, and could consequently relax the consistency properties to go for eventual consistency.
>
> There is however no details about how this takes place. Attributes that are fun to know are:
>
> - What's the conflict resolution mechanism
> - how long does it take to detect a conflict
> - how long does it take to resolve a conflict
>
> For example, I looked at the following code: https://github.com/ostinelli/syn/blob/master/src/syn_consistency.erl#L255-L262
>
> case CallbackModule of
> undefined ->
> error_logger:warning_msg("Found a double process for ~s, killing it on local node ~p", [Key, node()]),
> exit(LocalProcessPid, kill);
> _ -> spawn(fun() ->
> error_logger:warning_msg("Found a double process for ~s, about to trigger callback on local node ~p", [Key, node()]),
> CallbackModule:CallbackFunction(Key, LocalProcessPid) end)
> end
>
> And this makes it look like it is possible for two nodes to find conflicting pids, and if they find it at the same time, both processes are killed at once. This can be worked-around by setting up a function that always picks the same pid no matter who executes it (exit(max(P1,P2), kill), for example), but killing the local pid always risks having all nodes involved making that same decision and then having nobody left as soon as there's a conflict.
>
>
>
> When a node is disconnected from the cluster, the other nodes will remove from their mnesia tables all the pids (and hence the keys) that run on the disconnected node, and viceversa:
> https://github.com/ostinelli/syn/blob/master/src/syn_consistency.erl#L134
>
> This means that the disconnected node *does not* have in its mnesia replica the keys of all the other nodes, and the other nodes *do not* have in their mnesia replicas the keys of the disconnected node.
>
> If the disconnected node was to merge back in right away (i.e. with no new registrations happening), there simply wouldn't be any conflicts and everything would be merged in.
>
> In a more realistic scenario, the nodes of the cluster and the disconnected node keep registering new pids.
> If, during the net split, there's no unique key that has been used both on the disconnected node and on the rest of the cluster, then we're back to the previous scenario: everything gets merged in.
> If the same unique key has been registered both on the disconnected node and on the cluster, then we have a conflict.
>
> In this case, if you scroll a little above in the code, you'll see that at that all of the merge code runs inside of a global lock:
> https://github.com/ostinelli/syn/blob/master/src/syn_consistency.erl#L180
>
> When one node starts the merge, the other nodes are basically waiting. The risk of having both killed is therefore non-existent. Or, I might have forgotten something (it happens!), in which way I'd be delighted to know and improve the code :)
>
> Just to give you an example of what I've been observing: 2 nodes, 1 million connected (and registered) devices, a net split of 5 minutes, less than 10 conflicts, resolved in less than 500ms from the moments mnesia signalled an inconsistent database, to the moment the global lock is released).
The handling of conflicts is important when classifying the system. I have seen in the code that "doubles" are purged which are likely when the same name exists in two separate network partitions that are attempting to merge back together. You have used the term "eventually consistent" to basically mean "consistent until a netsplit occurs", due to the loss of data when separate network partitions are merged. Due to using a global lock to resolve any conflicts that exist during the merge, you are losing availability during that time period, even if it is only 500ms for 2 nodes with a decent amount of processes. So, that means your system is partition tolerant all the time while losing both consistency and availability when a netsplit occurs.
I understand this type of system matches your use case, but I think it is important to be clear about the impact of netsplits.
Best Regards,
Michael
>
>
> So what could be the impact of this on a cluster where the conflict rate is higher, say 80%? Would an app like Syn mostly kill my entire cluster if I don't configure it properly? Or maybe I misunderstood something from my very brief reading of the code.
>
>
> Please consider that as per the use-case defined (IoT applications), conflicts are extremely minor.
> Your example would mean that 80% of the devices, during a net split, connected both to the disconnected node and the rest of the cluster. It is weird to say the least.
>
> That being said, I have not benchmarked this case scenario, but here again we are talking about finding the conflicting keys, and sending an exit signal to 1 of the 2 conflicting pids:
> https://github.com/ostinelli/syn/blob/master/src/syn_consistency.erl#L238
>
> These things are rather quick in the 7 digit numbers.
>
> The speed boost is interesting, but without more details about the app's handling of conflict when the uniqueness of names isn't guaranteed, it's hard to make myself a solid idea of how it would go in the wild.
>
>
> If you mean uniqueness of names in a precise given time, indeed. Syn is eventually consistent.
>
>
> Best,
> r.
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150707/07e8a7ea/attachment.htm>
More information about the erlang-questions
mailing list