[erlang-questions] Can OTP worker resume from same state after being restarted by supervisor?

Sun Sep 19 02:21:46 CEST 2010

# Tushar Deshpande 2010-09-19:
> I've a question about OTP's restart strategy.
> 
> I wrote a simple OTP application with just a root supervisor
> and a single worker.  The worker has an internal state defined
> by a single integer (count).  The 'count' is initialized to zero.
> Each call to the worker process increments this count by one.
> I called the worker process 5 times using gen_server:call.
> This updated current value of counter to be 5.
> 
> Then, I opened the appmon and killed the worker process.
> OTP supervisor promptly restarted it.  I thought that the
> process would be resume with the same state that it had at
> the time of crash.  But, I observed that for the restarted process,
> the 'counter' was reset to zero.

Right, this is the expected behaviour -- supervisor's children are
restarted with the arguments given in that supervisor's child specs
and these arguments are used by child's init/1 function to calculate
initial state.

Restarting with exactly the same internal state by default would
be a bad idea -- if that state led to a crash, then there's really
no reason at all to believe in it's integrity; better start all
over again.

> I would like the process to resume from the same state after
> it's restarted by supervisor.
> 
> Is it possible to do this in OTP?

Yes. If you want to store state persistently, you'd probably go with
Mnesia table, for transient storage (not preserved across application
restarts) you could use a public ETS table owned by a do-nothing server
process living alongside you child-supervisor.

As to when to store worker process state snapshots -- you can either
isolate known-safe spots in worker lifetime or trap exits and store
state from terminate/2 callback. In more realistic scenarios, you'd
probably offload this responsibility from workers altogether and
come up with some kind of job manager server to deal with it instead.

Other point is that typically you'd want to store only carefully chosen
bits of worker state, instead of whole context structure -- that could
get messy if you're keeping track of more dynamic things than an integer
or two (think pids or ets ids or some transaction ids that could all be
long gone by the time you restart & restore worker context).

Lastly, for the particular case you're describing, you can remove the
counter from server state and instead rely on ets:update_counter/N
or mnesia:dirty_update_counter/N to manage it.

HTH,
	-- Jachym