[erlang-questions] Erlang hangs in supervisor:do_terminate/2

Nico Kruber nico.kruber@REDACTED
Fri Jul 31 18:15:52 CEST 2015


On Monday 13 Jul 2015 10:11:38 Lukas Larsson wrote:
> Hello Nico,
> 
> On Sat, Jul 11, 2015 at 2:33 PM, Nico Kruber <nico.kruber@REDACTED> wrote:
> > Hi,
> > I'm having trouble with supervisor:do_terminate/2 in both, Erlang 18.0.1
> > and
> > 18.0.2 which I haven't seen with earlier versions so far. I currently do
> > not
> > have a minimal use case but will try to come up with something soon.
> 
> Would be great if you could manage that. Minimal examples are always much
> easier to work with.
> 
> > I'm using Scalaris and inside a single process, I'm starting its services
> > (multiple processes in a supervisor tree) and stopping them again in a
> > loop.
> > Sometimes, although very rarely, stopping the services seems to hang. When
> > I
> > send the beam process a SIGUSR1 I can always see two processes being in
> > "Running" state:
> > 1) a supervisor in supervisor:do_terminate/2 (any of the present
> > supervisors -
> > not always the same!)
> > 2) a child/worker of this supervisor handling a message (or at least, so
> > it
> > seems)
> > 
> > Their stacktraces seem inconclusive, please find an example of the two
> > processes from the crashdump_viewer below.
> 
> Are the stack traces always the same for the supervisor? i.e. it is Running
> in do_terminate? Since there is a receive in do_terminate, if it is stuck I
> would have expected it to be in the Waiting state. But if it is Running and
> hanging, it could point to some kind of live lock.
> 
> > Is there any known problem/change in Erlang 18 that could have caused
> > this?
> 
> There are changes in supervisor for 18 (the introduction of maps as child
> specs), but it *should* not cause any problems like these.
> 
> A smaller example demonstrating what is going wrong for you would help a
> lot in trying to understand what is going on.
> 
> Lukas

Hi Lukas,
unfortunately, reproducing this error requires some effort. For now, I can 
safely reproduce it with Scalaris (see below) but the system needs to be 
overloaded for the bug to appear early.

The supervisor stack traces are always the same (with do_terminate on top).

Whenever the bug appears, everything is stuck in that erlang VM, even remote 
connections, e.g. for debugging, do not work anymore but the node is still 
listed in epmd.


Please find the gdb backtraces attached. I created 2 snapshots each of 2 
hanging erlang VMs - it does indeed look like a livelock since the traces only 
differ slightly.
I also have the erlang crashdump files but they are too large to post here 
(3-4MiB) - I can send them to you personally if you like to have a look.


Regards
Nico

## Way to re-produce using Scalaris:

wget https://github.com/scalaris-team/scalaris/archive/master.tar.gz -O - | 
tar -xz
cd scalaris-master
./configure && make

then run the following in (#CPU-cores*2) shells (e.g. 8 shells for the 4 CPU 
cores I have):

read -r -d '' EVALCMD <<'EOF'
log:set_log_level(none),
[begin
  io:format(" ~B", [I]),
  admin:add_node([{first}]),admin:add_nodes(3),
  io:format("K "),
  Killed = api_vm:kill_nodes(4),
  4 = length(Killed)
 end || I <- lists:seq(1, 10000)].
EOF
SCALARIS_PORT=$RANDOM
ERL_SCHED_FLAGS="" ./bin/scalarisctl -n node$RANDOM -m start -y $RANDOM -p 
$SCALARIS_PORT -t first_nostart -e "-noinput -scalaris mgmt_server 
\"{{127,0,0,1},${SCALARIS_PORT},mgmt_server}\" -scalaris known_hosts 
\"[{{127,0,0,1},${SCALARIS_PORT},service_per_vm}]\" -scalaris 
monitor_perf_interval \"0\" -scalaris lb_active_use_gossip \"false\" -scalaris 
gossip_load_number_of_buckets \"1\" -scalaris gossip_load_additional_modules 
\"[]\" -pa '$PWD/test' -eval '${EVALCMD}'"

(if one start fails immediately, just try again - this hack uses $RANDOM but 
does not restrict its values accordingly...)
anyway, after some time, at least one of these processes will stop creating 
any output
-> this erlang VM hangs!


FYI: gdb traces are created with
gdb -ex "set pagination 0" -ex "thread apply all bt full" --batch -p <PID>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hang-1.tar.gz
Type: application/x-compressed-tar
Size: 9478 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150731/bc5f1441/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hang-2.tar.gz
Type: application/x-compressed-tar
Size: 9684 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150731/bc5f1441/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: This is a digitally signed message part.
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150731/bc5f1441/attachment-0002.bin>


More information about the erlang-questions mailing list