[erlang-questions] gen_statem - state_timeout sometimes reaching callback module as [info, {timeout, Ref, Name}

Fri Apr 12 17:13:27 CEST 2019

Hi Raimo,

Thanks for looking at this, and the background on the implementation which is very useful. I am in the process of putting together a simple test case, but having trouble reproducing it. It appears to happen very infrequently - possibly something that is load related? The original output was from the Crash Report which I’ve attached - where you can see the info timeout was received by our gen_statem behaviour callback. Our OTP is 21.2 (without any 21.2.X patches):

{stdlib,"ERTS  CXC 138 10","3.7”}
{kernel,"ERTS  CXC 138 10","6.2”}

Your question - “why so many?” - it could be a failure of a downstream service that is ultimately causing this, and we’re not backing off in this case - something that I’m checking too. However, I still thought it was interesting that the info timeout reaches the behaviour module.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: info.timeout.txt
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190412/9575dfd7/attachment.txt>
-------------- next part --------------

I’ll try and get a simple test case for you.
Regards,
Peter.

> On 12 Apr 2019, at 13:38, Raimo Niskanen <raimo+erlang-questions@REDACTED> wrote:
> 
> On Fri, Apr 12, 2019 at 10:38:42AM +0200, Raimo Niskanen wrote:
>> On Wed, Apr 10, 2019 at 01:41:06PM +0100, Peter Morgan wrote:
>>> Hello -
>>> 
>>> We are _sometimes_ seeing cases where a state_timeout in a gen_statem results in the timeout reaching the callback module as info {timeout, Ref, Name} with OTP 21.2.
>>> 
>>> The crash looks like:
>>> 
>>> =ERROR REPORT==== 8-Apr-2019::23:37:19.035346 === <0.969.0> gen_statem:error_info/5:1895
>>> ** State machine {kafire_fetcher,<<“abc">>,<<“def">>,0} terminating
>>> ** Last event = {info,{timeout,#Ref<0.2399112782.889192450.118024>,fetch}}
>>> 
>>> We crash because we are not expecting info messages - interestingly the following messages are in the queue for the crashed process:
>>> 
>>> 
>>>    message_queue_len: 279
>>>    messages: [{timeout,#Ref<0.2399112782.903610369.70160>,fetch},
>>>                  {cancel_timer,#Ref<0.2399112782.903610369.70160>,false},
>>>                  {cancel_timer,#Ref<0.2399112782.903610369.70168>,4},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118038>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118039>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118043>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118044>,4},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118045>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118049>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118050>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118051>,4},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118052>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118053>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118054>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118055>,4},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118056>,5},
>>>                  {cancel_timer,#Ref<0.2399112782.889192450.118060>,5},
>> 
>> Is that from a process_info() call?
>> 
>>> 
>>> Followed by lots more cancel_timer messages (200ish!). Our gen_statem does use {state_timeout, 500, fetch} so we are expecting a “fetch” timeout to happen, and we use repeat_state_and_data to requeue the state timeout (and also transition to other states). Is it possible that timeouts in gen_statem can be delivered as an info message?
>> 
>> It is not supposed to happen.  This must be a bug.
>> 
>> The gen_statem engine keeps track of the running state_timeout timer
>> and should never present it as an info event.
>> 
>> When a timer is stopped (or restarted) it is done with an asynchronous
>> cancel (in OTP 21) so the cancel_timer messages comes from that.
>> They should be matched against map entries in the internal gen_statem
>> engine state when they arrive so them being in the inbox may be ok.
>> 
>> However, how come you have so many?  Are you restarting the state_timeout
>> in a very tight loop of repeat_state:s?  Can you show the essential parts
>> of the code that causes state_timeout:s?
>> 
>> Also - OTP-21.2; is that your exact OTP version?  What is your stdlib version?
>> 
>> I will look at the relevant parts of the code with the new knowledge that
>> a state_timeout timer can be lost, probably in combination with
>> repeat_state and state_timeout restart!
> 
> 
> Sorry, the knowledge did not help; I did not find anything suspicous.
> 
> More information will be needed...
> 
>> 
>> On master since Jan 21 so in OTP-22.0-rc1 among other changes the timer handling
>> has been rewritten to use a synchronous cancel, which simplified the code
>> significantly.  I do not know if that would be worth trying?
>> 
>> / Raimo Niskanen
>> 
>> 
>>> 
>>> Thanks
>>> Peter.
>>> 
> 
> -- 
> 
> / Raimo Niskanen, Erlang/OTP, Ericsson AB
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions