[erlang-questions] What happens if two gen_server call each other at the exact same time?

Fri Mar 25 02:13:32 CET 2016

On 2016年3月24日 木曜日 23:49:44 Caiyun Deng wrote:
> Hi!
> What happens if two gen_server call each other at the exact same time?
> Thanks.

A deadlock until the timeout, (A waits on B, B waits on A) with the
default gen_* timeout being 5 seconds (defined as 5000ms).

Any time you write a protocol between two *identical* processes you
always wind up creating the potential for deadlock if you use synchronous
calls.

So the first rule of No Deadlock Club is you don't synchronously talk about
No Deadlock Club.

The second rule of No Deadlock Club is to make everything async unless
you have a definite reason that you can't.

Its surprising, actually, to find just how much stuff in a system you can
do async, especially once you separate the "I'm a data processor/transformer"
sort of tasks from the "I hold state and serve answers to questions" type
tasks within a system. This is an oversimplification, of course, but
often you'll have things that are pretty clearly state managers and other
things that are pretty clearly workers that are tasked with something to
perform in relative isolation of the rest of the system.

This requires identifying both kinds of tasks, though, and at the outset
of a project that is not always obvious at first -- rubber ducking, crayons
and construction paper, and mocking up a system (which in can quickly turn
into a full-blown system) can all be effective tools to help identify the
pieces and what their responsibilities are.

This brings me back to identical peers communicating, though. Occasionally
you really will require peers to contact one another directly, and you
won't be able to do everything asynchronously (or if you do, you'll wind
up re-implementing call/2 without realizing it, and then phased transactions
along with it).

In this case I cheat and write another process definition that exists only
to represent the transaction between the two peers. When Worker.A needs to
contact Worker.B it can now spawn_link Hand.A to stand between them.
Worker.A -> Hand.A is async, Hand.A <=> Worker.B is sync.

If Worker.B spawn_links a Hand.B at the exact same time the communication
between Hand.B <=> Worker.A is sync, but Worker.B is free again because
Worker.B -> Hand.B was async. Along with creating "hands" to handle the
peer synch communication, Worker.A and Worker.B can mark themselves in the
state of being in a transaction with the other, and decide to kill one of
the hands (I resolve this by sorting on the Worker.A and Worker.B Pids and
always killing the higher-value one -- arbitrary, yes, but never hangs).

Making sure that it is not possible for two processes to synchronously
contact each other at exactly the same time is a big task. When you are
writing a totally new project you can catalog your synch calls, but in
a pre-existing project things can be far more mysterious and simply grepping
for things that look like sync calls may not reveal them all. So always
have timeouts. Timeout of 'infinity' is cute or some ideal or whatever to
some people, but it can leave deadlocks between peers literally for months
without you noticing.

Unless you are loading billions of objects from a text file or waiting on
an external network resources, 5 seconds is a *very* long time within an
Erlang system. Think of it this way: we don't usually wait 5 seconds for
a network file system to mount or to acquire an auth key from a remote
resource before deciding that we should timeout and try again. Why would
writing infinite timeouts into a massively concurrent system make sense?

I have simplified some things above, but hopefully expressed the main ideas
surrounding this. This is actually a pretty big deal in certain kinds of
systems -- not so much in others.

-Craig