Detect pid reuse

Thu Jul 9 14:57:43 CEST 2020

> According to the docs, the maximum is 268435456 processes. With our spawn rate it takes ~37 hours to use each pid at least once (if pids are used in round robin manner). We set the limit less than the maximum and for our settings it takes ~4 hours to use each pid at least once.

Even if you use a smaller process table, you should have 2^28 PIDs
available (you just can have some PIDs living at the same time).
If you also have processes running longer than ~37 hours (and they
terminate time-to-time), then your probability of PID reuse is much
higher.

Back to your problem - PID reuse is painful to workaround. I would
suggest something like storing in a process belonging to connection,
which database they are connected to (some kind of validation data).
You will also need to extend "gen_server:call" with the same
validation data, and the connection process will then check, if
validation data is correct.

Connection (called) process:
...
handle_call({CallArgs, CallDatabase}, _From, #state{database =
StateDatabase} = State) when CallDatabase =/= StateDatabase ->
  {reply, wrong_database, State};
handle_call({CallArgs, CallDatabase}, _From, #state{database =
StateDatabase} = State) when CallDatabase =:= StateDatabase ->
  do_the_work(CallArgs, State);
...

Calling process:
...
call(CallArgs, Database) ->
  DB = find_pid_in_supervisor(Database);
  case gen_server:call(DB. {CallArgs, Database}) of
    wrong_database ->
      % PID reuse race condition - try again
      call(CallArgs, Database)
   Result ->
      Result
  end
end
...

All other approaches look like they have some sort of race condition
regarding PID reuse.

Jan Chochol

On Thu, Jul 9, 2020 at 1:14 PM Dinislam Salikhov
<Dinislam.Salikhov@REDACTED> wrote:
>
> Hi Jan,
>
> > How many processes are you spawning per second?
>
> ~2000 per second per node. The most of them live a few seconds or less.
>
> According to the docs, the maximum is 268435456 processes. With our spawn rate it takes ~37 hours to use each pid at least once (if pids are used in round robin manner). We set the limit less than the maximum and for our settings it takes ~4 hours to use each pid at least once.
>
> > Erlang is trying quite hard to make PID unique (there is a very small probability to hit PID reuse) - isn't it possible, that there is a problem somewhere else?
>
> It took me quite a long time to find out the culprit. And it was pure luck that I could.
> We have gen_fsm processes (along with other gen_*) and they print a message they receive in handle_info if the message is unexpected. And right after the logs preceding the gen_server:call(DB, CallArgs, infinity), there was a log of a gen_fsm that {'$gen_call', CallArgs} is unexpected. Pid of gen_server (DB) and pid of gen_fsm are the same (they are logged). gen_server:call/3 hung (as gen_fsm couldn't answer it), so I had time to investigate the hanging process. So everything points to pid reuse.
>
> Dinislam Salikhov
> ________________________________________
> From: Jan Chochol <jan.chochol@REDACTED>
> Sent: Thursday, July 9, 2020 1:35 PM
> To: Dinislam Salikhov
> Cc: erlang-questions@REDACTED
> Subject: Re: Detect pid reuse
>
> Hi Dinislam,
>
> How many processes are you spawning per second?
> We (in our biggest production cluster) are spawning more than 10000
> processes per second with very variable lifetime (from milliseconds to
> hours) on the system running for more than a year and never faced a
> problem with PID reuse (our logic also depends on PID uniqueness).
> Erlang is trying quite hard to make PID unique (there is a very small
> probability to hit PID reuse) - isn't it possible, that there is a
> problem somewhere else?
>
> If you need more uniqueness you can use e.g. "erlang:make_ref/1" (same
> thing is used by "gen:call" to connect requests with responses) as
> your suggested token - I am not aware about any other workaround.
> You can also try experimenting with the size of the Erlang process
> table- it can affect the probability of PID reuse.
>
> Jan Chochol
>
> On Thu, Jul 9, 2020 at 11:03 AM Dinislam Salikhov
> <Dinislam.Salikhov@REDACTED> wrote:
> >
> > Unfortunately, registering a process with a name doesn't help much. It reduces a time window where the race may occur though.
> > For instance, when gen_server:call/3 is invoked, the library code calls whereis(Name) to get the pid and then sends it a message {'$gen_call,...}. So between erlang:whereis/1 and erlang:send/2, the pid may be reused (actually, it is between erlang:whereis/1 and erlang:monitor/2 followed by erlang:send/2, so we will monitor the wrong process).
> > See lib/stdlib/src/gen.erl which is used by lib/stdlib/src/gen_server.erl
> >
> > > If you have multiple connections to any given db (a pool of pools, if
> > you will), using a process group module like pg makes this easy.
> >
> > Never used it before. I'll have a look. Thanks for the reference.
> >
> > Dinislam Salikhov
> > ________________________________________
> > From: Aaron Seigo <aseigo@REDACTED>
> > Sent: Thursday, July 9, 2020 10:26 AM
> > To: Dinislam Salikhov
> > Cc: erlang-questions@REDACTED
> > Subject: Re: Detect pid reuse
> >
> > On 2020-07-06 14:09, Dinislam Salikhov wrote:
> > > If I want to send a command to the database, I search for the pid of
> > > the corresponding connection (in supervisor's children list). And
> >
> > Perhaps register the processes with a name so that instead of searching
> > for a literal pid, which may indeed change and requires more bookkeeping
> > in your application code, you lookup the relevant connect by a name in a
> > process registry. Should the old connection go away, the new one takes
> > over the same name.
> >
> > If you have multiple connections to any given db (a pool of pools, if
> > you will), using a process group module like pg makes this easy.
> >
> > Even then, you'll obviously need to handle the failure case of the
> > process exiting between the message being sent and the response being
> > received, but at least the lookup will be consistent.
> >
> > --
> > Aaron Seigo