Correct ordering of messages in a distributed system (and partition tolerance)

Tue Feb 22 22:47:36 CET 2011

Hey everyone,

I'll start with an example:

We are creating a blackjack game to run on a single node.  A single game is
represented as a gen_server and users connected to the server via TCP can
join the table and receive table events ('Player 1 bets 200 chips, etc.).
 In this scenario it's important that events are sent in the order that they
occur to ensure a consistent representation of the game for all users.  In
this particular case we can fairly easily ensure that the users will receive
the events in the correct order - events/requests are serialized through the
gen_server, so the gen_server instance can synchronously send out events to
the players at the table.

Now we decide that we want to add two more nodes that can be used for load
balancing and also for fault tolerance in case a node dies.  This works
great, but the only problem is that any game instances will die if the node
they are on dies.  We decide that we will instead store game data in an
external db instead of in a gen_server instance, and so if a node dies, it's
not a big deal because any game interaction will be done via the db anyway
(which has fail-safe mechanisms built in).  This works well, but we realize
that the messages to players could be sent in the wrong order if they are
connected to different servers.  If each update contains the full state of
the game this could still be ok if server-side or client-side we filter out
old updates via some auto-incrementing version id.

Is there a general approach to solving this sort of problem?  Using
something like gen_leader for the game in the first place sounds like a good
approach, but what about split brain scenarios?  Will gen_leader allow this
if the network is partitioned?  I've also thought about using mnesia and
subscribing event handlers to send out messages, but mnesia is still
susceptible split-brain scenarios.  I know Ulf Wiger has been doing work on
requiring a quorum for writes (
https://github.com/uwiger/otp/tree/mnesia-majority), which is awesome (!!),
but I'm not sure how ready this is for a production environment.

Despite the title of this message, I'm mostly curious about partition
tolerance.  When it comes to distributing data for fault tolerance I
continually run into an issue with how to handle network partition scenarios
- is it better to just look for solutions outside of erlang?

Ryan

PS I'm not making a blackjack game, just an example ;)