Detect pid reuse

Thu Jul 9 13:36:22 CEST 2020

> 在 2020年7月9日，19:14，Dinislam Salikhov <Dinislam.Salikhov@REDACTED> 写道：
> 
> Hi Jan,
> 
>> How many processes are you spawning per second?
> 
> ~2000 per second per node. The most of them live a few seconds or less.

Do you mean ~2000 database connection processes crashed per second per node?

This means each database connection process do something and crash a few seconds later.

> 
> According to the docs, the maximum is 268435456 processes. With our spawn rate it takes ~37 hours to use each pid at least once (if pids are used in round robin manner). We set the limit less than the maximum and for our settings it takes ~4 hours to use each pid at least once.
> 
>> Erlang is trying quite hard to make PID unique (there is a very small probability to hit PID reuse) - isn't it possible, that there is a problem somewhere else?
> 
> It took me quite a long time to find out the culprit. And it was pure luck that I could.
> We have gen_fsm processes (along with other gen_*) and they print a message they receive in handle_info if the message is unexpected. And right after the logs preceding the gen_server:call(DB, CallArgs, infinity), there was a log of a gen_fsm that {'$gen_call', CallArgs} is unexpected. Pid of gen_server (DB) and pid of gen_fsm are the same (they are logged). gen_server:call/3 hung (as gen_fsm couldn't answer it), so I had time to investigate the hanging process. So everything points to pid reuse.
> 
> Dinislam Salikhov
> ________________________________________
> From: Jan Chochol <jan.chochol@REDACTED>
> Sent: Thursday, July 9, 2020 1:35 PM
> To: Dinislam Salikhov
> Cc: erlang-questions@REDACTED
> Subject: Re: Detect pid reuse
> 
> Hi Dinislam,
> 
> How many processes are you spawning per second?
> We (in our biggest production cluster) are spawning more than 10000
> processes per second with very variable lifetime (from milliseconds to
> hours) on the system running for more than a year and never faced a
> problem with PID reuse (our logic also depends on PID uniqueness).
> Erlang is trying quite hard to make PID unique (there is a very small
> probability to hit PID reuse) - isn't it possible, that there is a
> problem somewhere else?
> 
> If you need more uniqueness you can use e.g. "erlang:make_ref/1" (same
> thing is used by "gen:call" to connect requests with responses) as
> your suggested token - I am not aware about any other workaround.
> You can also try experimenting with the size of the Erlang process
> table- it can affect the probability of PID reuse.
> 
> Jan Chochol
> 
>> On Thu, Jul 9, 2020 at 11:03 AM Dinislam Salikhov
>> <Dinislam.Salikhov@REDACTED> wrote:
>> 
>> Unfortunately, registering a process with a name doesn't help much. It reduces a time window where the race may occur though.
>> For instance, when gen_server:call/3 is invoked, the library code calls whereis(Name) to get the pid and then sends it a message {'$gen_call,...}. So between erlang:whereis/1 and erlang:send/2, the pid may be reused (actually, it is between erlang:whereis/1 and erlang:monitor/2 followed by erlang:send/2, so we will monitor the wrong process).
>> See lib/stdlib/src/gen.erl which is used by lib/stdlib/src/gen_server.erl
>> 
>>> If you have multiple connections to any given db (a pool of pools, if
>> you will), using a process group module like pg makes this easy.
>> 
>> Never used it before. I'll have a look. Thanks for the reference.
>> 
>> Dinislam Salikhov
>> ________________________________________
>> From: Aaron Seigo <aseigo@REDACTED>
>> Sent: Thursday, July 9, 2020 10:26 AM
>> To: Dinislam Salikhov
>> Cc: erlang-questions@REDACTED
>> Subject: Re: Detect pid reuse
>> 
>>> On 2020-07-06 14:09, Dinislam Salikhov wrote:
>>> If I want to send a command to the database, I search for the pid of
>>> the corresponding connection (in supervisor's children list). And
>> 
>> Perhaps register the processes with a name so that instead of searching
>> for a literal pid, which may indeed change and requires more bookkeeping
>> in your application code, you lookup the relevant connect by a name in a
>> process registry. Should the old connection go away, the new one takes
>> over the same name.
>> 
>> If you have multiple connections to any given db (a pool of pools, if
>> you will), using a process group module like pg makes this easy.
>> 
>> Even then, you'll obviously need to handle the failure case of the
>> process exiting between the message being sent and the response being
>> received, but at least the lookup will be consistent.
>> 
>> --
>> Aaron Seigo
>