[erlang-questions] unsplit - resolving mnesia inconsistencies

Thu Feb 4 22:49:57 CET 2010

On Thu, Feb 04, 2010 at 10:39:02PM +0100, Ulf Wiger wrote:
> Andrew Thompson wrote:
> >This is great. We've been doing this manually in not nearly so nice a
> >fashion. How will this work with mnesia clusters of more than 2, like
> >lets say a 3 node cluster where one node gets split off by a netsplit
> >for a while - how do you avoid both of the other nodes trying to
> >reconcile the split?
> 
> These are good questions. I guess the big question is how many
> islands you expect to end up with in the worst case. In the
> case you mention, there are still two islands. One of the
> instances will enter the critical section (I guess the call
> to global:trans/3 has no reason not to use all available
> nodes) and address the split. The others should notice that
> it's fixed once they enter the critical section.
> 
> But I appreciate all attempts to poke holes in the approach.
> If we find a scenario that is not fixable, it is certainly
> better to find out this way, than having your mission-critical
> system go belly-up at the worst possible time. :)
> 
> There will always be pathological cases, of course. I've
> seen dual-ethernet backbones become so fragmented that
> the full mesh in an Erlang network started looking like
> Swiss cheese. If we can handle at least the sane error
> situations in a reliable way, I'll be fairly happy.
>

Well, I think since you're only handing one 'unsplit' at a time and
locking while doing it, any other 'unsplits' that happen while you're
handling the first one will patiently wait their turn in the mailbox, so
you should only ever have 1 unsplit to deal with at any one time.

I agree that so long as we handle the typical netsplit cases we should
be good - it might be nice to have some tools that could be used to
manually fix things if something goes terribly wrong - its often
annoying to fix mnesia issues like this by hand.

Andrew