Detect pid reuse

Dinislam Salikhov Dinislam.Salikhov@REDACTED
Thu Jul 9 13:14:13 CEST 2020


Hi Jan,

> How many processes are you spawning per second?

~2000 per second per node. The most of them live a few seconds or less.

According to the docs, the maximum is 268435456 processes. With our spawn rate it takes ~37 hours to use each pid at least once (if pids are used in round robin manner). We set the limit less than the maximum and for our settings it takes ~4 hours to use each pid at least once.

> Erlang is trying quite hard to make PID unique (there is a very small probability to hit PID reuse) - isn't it possible, that there is a problem somewhere else?

It took me quite a long time to find out the culprit. And it was pure luck that I could.
We have gen_fsm processes (along with other gen_*) and they print a message they receive in handle_info if the message is unexpected. And right after the logs preceding the gen_server:call(DB, CallArgs, infinity), there was a log of a gen_fsm that {'$gen_call', CallArgs} is unexpected. Pid of gen_server (DB) and pid of gen_fsm are the same (they are logged). gen_server:call/3 hung (as gen_fsm couldn't answer it), so I had time to investigate the hanging process. So everything points to pid reuse.

Dinislam Salikhov
________________________________________
From: Jan Chochol <jan.chochol@REDACTED>
Sent: Thursday, July 9, 2020 1:35 PM
To: Dinislam Salikhov
Cc: erlang-questions@REDACTED
Subject: Re: Detect pid reuse

Hi Dinislam,

How many processes are you spawning per second?
We (in our biggest production cluster) are spawning more than 10000
processes per second with very variable lifetime (from milliseconds to
hours) on the system running for more than a year and never faced a
problem with PID reuse (our logic also depends on PID uniqueness).
Erlang is trying quite hard to make PID unique (there is a very small
probability to hit PID reuse) - isn't it possible, that there is a
problem somewhere else?

If you need more uniqueness you can use e.g. "erlang:make_ref/1" (same
thing is used by "gen:call" to connect requests with responses) as
your suggested token - I am not aware about any other workaround.
You can also try experimenting with the size of the Erlang process
table- it can affect the probability of PID reuse.

Jan Chochol

On Thu, Jul 9, 2020 at 11:03 AM Dinislam Salikhov
<Dinislam.Salikhov@REDACTED> wrote:
>
> Unfortunately, registering a process with a name doesn't help much. It reduces a time window where the race may occur though.
> For instance, when gen_server:call/3 is invoked, the library code calls whereis(Name) to get the pid and then sends it a message {'$gen_call,...}. So between erlang:whereis/1 and erlang:send/2, the pid may be reused (actually, it is between erlang:whereis/1 and erlang:monitor/2 followed by erlang:send/2, so we will monitor the wrong process).
> See lib/stdlib/src/gen.erl which is used by lib/stdlib/src/gen_server.erl
>
> > If you have multiple connections to any given db (a pool of pools, if
> you will), using a process group module like pg makes this easy.
>
> Never used it before. I'll have a look. Thanks for the reference.
>
> Dinislam Salikhov
> ________________________________________
> From: Aaron Seigo <aseigo@REDACTED>
> Sent: Thursday, July 9, 2020 10:26 AM
> To: Dinislam Salikhov
> Cc: erlang-questions@REDACTED
> Subject: Re: Detect pid reuse
>
> On 2020-07-06 14:09, Dinislam Salikhov wrote:
> > If I want to send a command to the database, I search for the pid of
> > the corresponding connection (in supervisor's children list). And
>
> Perhaps register the processes with a name so that instead of searching
> for a literal pid, which may indeed change and requires more bookkeeping
> in your application code, you lookup the relevant connect by a name in a
> process registry. Should the old connection go away, the new one takes
> over the same name.
>
> If you have multiple connections to any given db (a pool of pools, if
> you will), using a process group module like pg makes this easy.
>
> Even then, you'll obviously need to handle the failure case of the
> process exiting between the message being sent and the response being
> received, but at least the lookup will be consistent.
>
> --
> Aaron Seigo


More information about the erlang-questions mailing list