[erlang-questions] gen_statem - state_timeout sometimes reaching callback module as [info, {timeout, Ref, Name}

Raimo Niskanen raimo+erlang-questions@REDACTED
Fri Apr 12 10:38:42 CEST 2019


On Wed, Apr 10, 2019 at 01:41:06PM +0100, Peter Morgan wrote:
> Hello -
> 
> We are _sometimes_ seeing cases where a state_timeout in a gen_statem results in the timeout reaching the callback module as info {timeout, Ref, Name} with OTP 21.2.
> 
> The crash looks like:
> 
> =ERROR REPORT==== 8-Apr-2019::23:37:19.035346 === <0.969.0> gen_statem:error_info/5:1895
> ** State machine {kafire_fetcher,<<“abc">>,<<“def">>,0} terminating
> ** Last event = {info,{timeout,#Ref<0.2399112782.889192450.118024>,fetch}}
> 
> We crash because we are not expecting info messages - interestingly the following messages are in the queue for the crashed process:
> 
> 
>     message_queue_len: 279
>     messages: [{timeout,#Ref<0.2399112782.903610369.70160>,fetch},
>                   {cancel_timer,#Ref<0.2399112782.903610369.70160>,false},
>                   {cancel_timer,#Ref<0.2399112782.903610369.70168>,4},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118038>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118039>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118043>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118044>,4},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118045>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118049>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118050>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118051>,4},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118052>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118053>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118054>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118055>,4},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118056>,5},
>                   {cancel_timer,#Ref<0.2399112782.889192450.118060>,5},

Is that from a process_info() call?

> 
> Followed by lots more cancel_timer messages (200ish!). Our gen_statem does use {state_timeout, 500, fetch} so we are expecting a “fetch” timeout to happen, and we use repeat_state_and_data to requeue the state timeout (and also transition to other states). Is it possible that timeouts in gen_statem can be delivered as an info message?

It is not supposed to happen.  This must be a bug.

The gen_statem engine keeps track of the running state_timeout timer
and should never present it as an info event.

When a timer is stopped (or restarted) it is done with an asynchronous
cancel (in OTP 21) so the cancel_timer messages comes from that.
They should be matched against map entries in the internal gen_statem
engine state when they arrive so them being in the inbox may be ok.

However, how come you have so many?  Are you restarting the state_timeout
in a very tight loop of repeat_state:s?  Can you show the essential parts
of the code that causes state_timeout:s?

Also - OTP-21.2; is that your exact OTP version?  What is your stdlib version?

I will look at the relevant parts of the code with the new knowledge that
a state_timeout timer can be lost, probably in combination with
repeat_state and state_timeout restart!

On master since Jan 21 so in OTP-22.0-rc1 among other changes the timer handling
has been rewritten to use a synchronous cancel, which simplified the code
significantly.  I do not know if that would be worth trying?

/ Raimo Niskanen


> 
> Thanks
> Peter.
> 

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB



More information about the erlang-questions mailing list