Detect pid reuse

Thu Jul 9 14:37:19 CEST 2020

> How about spawning processes which related to database connection in independent nodes?

Well, it may solve the issue (sending a message to wrong pid) by cost of architectural complexity. I can't say right now for sure; it requires experimenting.

Though the gotcha with pid reuse will wait for the next developer to deal with it :)
________________________________________
From: Yao Bao <by@REDACTED>
Sent: Thursday, July 9, 2020 3:26 PM
To: Dinislam Salikhov
Cc: Jan Chochol; erlang-questions@REDACTED
Subject: Re: Detect pid reuse

> 在 2020年7月9日，20:20，Dinislam Salikhov <Dinislam.Salikhov@REDACTED> 写道：
>
>>>> How many processes are you spawning per second?
>>> ~2000 per second per node. The most of them live a few seconds or less.
>> Do you mean ~2000 database connection processes crashed per second per node?
>
> No. I mean that 2000 processes are spawned overall  A fraction of them is connection processes, indeed; but the most are processes not related to database directly (a process per client connection, for example).

How about spawning processes which related to database connection in independent nodes?
Then this problem might disappear (if those database connection processes do not crash frequently like others).

> ________________________________________
> From: Yao Bao <by@REDACTED>
> Sent: Thursday, July 9, 2020 2:36 PM
> To: Dinislam Salikhov
> Cc: Jan Chochol; erlang-questions@REDACTED
> Subject: Re: Detect pid reuse
>
>> 在 2020年7月9日，19:14，Dinislam Salikhov <Dinislam.Salikhov@REDACTED> 写道：
>>
>> Hi Jan,
>>
>>> How many processes are you spawning per second?
>>
>> ~2000 per second per node. The most of them live a few seconds or less.
>
> Do you mean ~2000 database connection processes crashed per second per node?
>
> This means each database connection process do something and crash a few seconds later.
>
>>
>> According to the docs, the maximum is 268435456 processes. With our spawn rate it takes ~37 hours to use each pid at least once (if pids are used in round robin manner). We set the limit less than the maximum and for our settings it takes ~4 hours to use each pid at least once.
>>
>>> Erlang is trying quite hard to make PID unique (there is a very small probability to hit PID reuse) - isn't it possible, that there is a problem somewhere else?
>>
>> It took me quite a long time to find out the culprit. And it was pure luck that I could.
>> We have gen_fsm processes (along with other gen_*) and they print a message they receive in handle_info if the message is unexpected. And right after the logs preceding the gen_server:call(DB, CallArgs, infinity), there was a log of a gen_fsm that {'$gen_call', CallArgs} is unexpected. Pid of gen_server (DB) and pid of gen_fsm are the same (they are logged). gen_server:call/3 hung (as gen_fsm couldn't answer it), so I had time to investigate the hanging process. So everything points to pid reuse.
>>
>> Dinislam Salikhov
>> ________________________________________
>> From: Jan Chochol <jan.chochol@REDACTED>
>> Sent: Thursday, July 9, 2020 1:35 PM
>> To: Dinislam Salikhov
>> Cc: erlang-questions@REDACTED
>> Subject: Re: Detect pid reuse
>>
>> Hi Dinislam,
>>
>> How many processes are you spawning per second?
>> We (in our biggest production cluster) are spawning more than 10000
>> processes per second with very variable lifetime (from milliseconds to
>> hours) on the system running for more than a year and never faced a
>> problem with PID reuse (our logic also depends on PID uniqueness).
>> Erlang is trying quite hard to make PID unique (there is a very small
>> probability to hit PID reuse) - isn't it possible, that there is a
>> problem somewhere else?
>>
>> If you need more uniqueness you can use e.g. "erlang:make_ref/1" (same
>> thing is used by "gen:call" to connect requests with responses) as
>> your suggested token - I am not aware about any other workaround.
>> You can also try experimenting with the size of the Erlang process
>> table- it can affect the probability of PID reuse.
>>
>> Jan Chochol
>>
>>> On Thu, Jul 9, 2020 at 11:03 AM Dinislam Salikhov
>>> <Dinislam.Salikhov@REDACTED> wrote:
>>>
>>> Unfortunately, registering a process with a name doesn't help much. It reduces a time window where the race may occur though.
>>> For instance, when gen_server:call/3 is invoked, the library code calls whereis(Name) to get the pid and then sends it a message {'$gen_call,...}. So between erlang:whereis/1 and erlang:send/2, the pid may be reused (actually, it is between erlang:whereis/1 and erlang:monitor/2 followed by erlang:send/2, so we will monitor the wrong process).
>>> See lib/stdlib/src/gen.erl which is used by lib/stdlib/src/gen_server.erl
>>>
>>>> If you have multiple connections to any given db (a pool of pools, if
>>> you will), using a process group module like pg makes this easy.
>>>
>>> Never used it before. I'll have a look. Thanks for the reference.
>>>
>>> Dinislam Salikhov
>>> ________________________________________
>>> From: Aaron Seigo <aseigo@REDACTED>
>>> Sent: Thursday, July 9, 2020 10:26 AM
>>> To: Dinislam Salikhov
>>> Cc: erlang-questions@REDACTED
>>> Subject: Re: Detect pid reuse
>>>
>>>> On 2020-07-06 14:09, Dinislam Salikhov wrote:
>>>> If I want to send a command to the database, I search for the pid of
>>>> the corresponding connection (in supervisor's children list). And
>>>
>>> Perhaps register the processes with a name so that instead of searching
>>> for a literal pid, which may indeed change and requires more bookkeeping
>>> in your application code, you lookup the relevant connect by a name in a
>>> process registry. Should the old connection go away, the new one takes
>>> over the same name.
>>>
>>> If you have multiple connections to any given db (a pool of pools, if
>>> you will), using a process group module like pg makes this easy.
>>>
>>> Even then, you'll obviously need to handle the failure case of the
>>> process exiting between the message being sent and the response being
>>> received, but at least the lookup will be consistent.
>>>
>>> --
>>> Aaron Seigo
>>
>
>