[erlang-questions] Erlang hangs in supervisor:do_terminate/2
Nico Kruber
nico.kruber@REDACTED
Fri Jul 31 18:15:52 CEST 2015
On Monday 13 Jul 2015 10:11:38 Lukas Larsson wrote:
> Hello Nico,
>
> On Sat, Jul 11, 2015 at 2:33 PM, Nico Kruber <nico.kruber@REDACTED> wrote:
> > Hi,
> > I'm having trouble with supervisor:do_terminate/2 in both, Erlang 18.0.1
> > and
> > 18.0.2 which I haven't seen with earlier versions so far. I currently do
> > not
> > have a minimal use case but will try to come up with something soon.
>
> Would be great if you could manage that. Minimal examples are always much
> easier to work with.
>
> > I'm using Scalaris and inside a single process, I'm starting its services
> > (multiple processes in a supervisor tree) and stopping them again in a
> > loop.
> > Sometimes, although very rarely, stopping the services seems to hang. When
> > I
> > send the beam process a SIGUSR1 I can always see two processes being in
> > "Running" state:
> > 1) a supervisor in supervisor:do_terminate/2 (any of the present
> > supervisors -
> > not always the same!)
> > 2) a child/worker of this supervisor handling a message (or at least, so
> > it
> > seems)
> >
> > Their stacktraces seem inconclusive, please find an example of the two
> > processes from the crashdump_viewer below.
>
> Are the stack traces always the same for the supervisor? i.e. it is Running
> in do_terminate? Since there is a receive in do_terminate, if it is stuck I
> would have expected it to be in the Waiting state. But if it is Running and
> hanging, it could point to some kind of live lock.
>
> > Is there any known problem/change in Erlang 18 that could have caused
> > this?
>
> There are changes in supervisor for 18 (the introduction of maps as child
> specs), but it *should* not cause any problems like these.
>
> A smaller example demonstrating what is going wrong for you would help a
> lot in trying to understand what is going on.
>
> Lukas
Hi Lukas,
unfortunately, reproducing this error requires some effort. For now, I can
safely reproduce it with Scalaris (see below) but the system needs to be
overloaded for the bug to appear early.
The supervisor stack traces are always the same (with do_terminate on top).
Whenever the bug appears, everything is stuck in that erlang VM, even remote
connections, e.g. for debugging, do not work anymore but the node is still
listed in epmd.
Please find the gdb backtraces attached. I created 2 snapshots each of 2
hanging erlang VMs - it does indeed look like a livelock since the traces only
differ slightly.
I also have the erlang crashdump files but they are too large to post here
(3-4MiB) - I can send them to you personally if you like to have a look.
Regards
Nico
## Way to re-produce using Scalaris:
wget https://github.com/scalaris-team/scalaris/archive/master.tar.gz -O - |
tar -xz
cd scalaris-master
./configure && make
then run the following in (#CPU-cores*2) shells (e.g. 8 shells for the 4 CPU
cores I have):
read -r -d '' EVALCMD <<'EOF'
log:set_log_level(none),
[begin
io:format(" ~B", [I]),
admin:add_node([{first}]),admin:add_nodes(3),
io:format("K "),
Killed = api_vm:kill_nodes(4),
4 = length(Killed)
end || I <- lists:seq(1, 10000)].
EOF
SCALARIS_PORT=$RANDOM
ERL_SCHED_FLAGS="" ./bin/scalarisctl -n node$RANDOM -m start -y $RANDOM -p
$SCALARIS_PORT -t first_nostart -e "-noinput -scalaris mgmt_server
\"{{127,0,0,1},${SCALARIS_PORT},mgmt_server}\" -scalaris known_hosts
\"[{{127,0,0,1},${SCALARIS_PORT},service_per_vm}]\" -scalaris
monitor_perf_interval \"0\" -scalaris lb_active_use_gossip \"false\" -scalaris
gossip_load_number_of_buckets \"1\" -scalaris gossip_load_additional_modules
\"[]\" -pa '$PWD/test' -eval '${EVALCMD}'"
(if one start fails immediately, just try again - this hack uses $RANDOM but
does not restrict its values accordingly...)
anyway, after some time, at least one of these processes will stop creating
any output
-> this erlang VM hangs!
FYI: gdb traces are created with
gdb -ex "set pagination 0" -ex "thread apply all bt full" --batch -p <PID>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hang-1.tar.gz
Type: application/x-compressed-tar
Size: 9478 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150731/bc5f1441/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hang-2.tar.gz
Type: application/x-compressed-tar
Size: 9684 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150731/bc5f1441/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: This is a digitally signed message part.
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150731/bc5f1441/attachment-0002.bin>
More information about the erlang-questions
mailing list