[erlang-questions] os:cmd hang on OTP 18
Mikael Pettersson
mikpelinux@REDACTED
Fri Jul 28 11:48:57 CEST 2017
Dániel Szoboszlay writes:
> I think I found the bug. After a fork/vfork the child process needs some
> initialisation. In case of vfork, this is done by executing the
> erl_child_setup program. In case of a fork however the setup code is in the
> same module, sys.c, where the fork happens. There are comments about how
> important is to keep erl_child_setup.c and the relevant parts of sys.c in
> sync.
>
> In OTP 18 erts began to use a new signal, ERTS_SYS_SUSPEND_SIGNAL
> internally. This signal is in fact SIGUSR2. The child process has to
> unblock all signals used by erts as part of its initialisation. And there's
> an inconsistency here between the vfork
> <https://github.com/erlang/otp/blob/OTP-18.3.4.5/erts/emulator/sys/unix/erl_child_setup.c#L134-L136>
> and fork
> <https://github.com/erlang/otp/blob/OTP-18.3.4.5/erts/emulator/sys/unix/sys.c#L1005-L1022>
> cases: erl_child_setup.c does not unblock SIGUSR2.
>
> And it turns out that lbzip2 wants to use SIGUSR2 for communication between
> its worker processes, but this signal is blocked when we call it from
> Erlang's os:cmd (with vfork), so the program hangs.
>
> This patch to erts/emulator/sys/unix/erl_child_setup.c solved the problem:
>
> @@ -134,6 +134,7 @@ main(int argc, char *argv[])
> sys_sigrelease(SIGCHLD);
> sys_sigrelease(SIGINT);
> sys_sigrelease(SIGUSR1);
> + sys_sigrelease(SIGUSR2);
>
> if (erts_spawn_executable) {
> if (argv[CS_ARGV_NO_OF_ARGS + 1] == NULL) {
>
>
> Daniel
Nice find.
This matters because the subsequent exec only resets caught signals
(since their handlers would disappear), while blocked signals remain
blocked breaking the child's expectations.
>
> On Thu, 27 Jul 2017 at 09:21 Dániel Szoboszlay <dszoboszlay@REDACTED>
> wrote:
>
> > Thanks Dmytro, this really helped a lot!
> >
> > I think the commit you pointed to is not directly related: it only changes
> > Erlang code, and if the behaviour depends on whether you are using a
> > release/debug build, the root cause is most probably somewhere in the C
> > code of erts.
> >
> > But the commit message talks about the emulator no longer using vfork, and
> > it was a good clue: disabling vfork on 18 prevents the problem. So this one
> > will finish:
> >
> > ERL_NO_VFORK=true erl +A0 +S 1:1 -noinput -noshell -eval 'os:cmd("tar -C
> > /tmp/ -xf /tmp/tartest --use-compress-program=lbzip2"), init:stop().'
> >
> > Thanks again for the clue, I will look into the difference between using
> > fork/vfork in OTP 18!
> >
> > Daniel
> >
> > On Thu, 27 Jul 2017 at 02:09 Dmytro Lytovchenko <
> > dmytro.lytovchenko@REDACTED> wrote:
> >
> >> I could observe the behaviour only in R18, but not in R19 and not in R20
> >> I also could not reproduce it in debug flavour of R18 emulator, but it
> >> reproduces reliably in release SMP variant.
> >>
> >> The changes to os.erl between 18.3.4.5 and 19.0 include removal of os:cmd
> >> server which might somehow be related (commit
> >> *200247f972b012ced0c4b2c6611f091af66ebedd*). This commit *possibly*
> >> fixes the behavior — in R19 (build 19.0 by Kerl) the behaviour does not
> >> happen.
> >>
> >> 2017-07-26 21:47 GMT+02:00 Dániel Szoboszlay <dszoboszlay@REDACTED>:
> >>
> >>> Honestly, I didn't try with other command variations. There are many
> >>> commands that do not hang when run from os:cmd, regardless of the OTP
> >>> version. But this particular command does hang with one OTP version, and
> >>> not with the other OTP version. So the difference is in OTP, and I want to
> >>> find out what has changed.
> >>>
> >>> Daniel
> >>>
> >>> On Wed, 26 Jul 2017 at 21:34 Dmytro Lytovchenko <
> >>> dmytro.lytovchenko@REDACTED> wrote:
> >>>
> >>>> Is it something lbzip2 related?
> >>>> Did you try normal single-thread bzip2? (-j flag or --bzip2)
> >>>> What is you use gzip? (-z or --gzip)
> >>>>
> >>>> 2017-07-26 21:27 GMT+02:00 Dániel Szoboszlay <dszoboszlay@REDACTED>:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I've encountered a strange problem with os:cmd when running tar and
> >>>>> lbzip2. Steps to reproduce:
> >>>>>
> >>>>> # create some lbzip2 compressed data
> >>>>>
> >>>>> dd if=/dev/urandom of=/tmp/testfile count=10
> >>>>> tar -cf - -C /tmp testfile | lbzip2 -6 -n 4 | dd of=/tmp/tartest status=none
> >>>>>
> >>>>>
> >>>>> # try to extract the archive from Erlang with os:cmd
> >>>>>
> >>>>> erl -noinput -eval 'os:cmd("tar -C /tmp/ -xf /tmp/tartest --use-compress-program=lbzip2"), init:stop().'
> >>>>>
> >>>>>
> >>>>> This worked fine with OTP 17.5.6.7, but with OTP 18.3.4.5 the command
> >>>>> hangs: lbzip2 just sits in a rt_sigsuspend syscall waiting for a USR2, PIPE
> >>>>> or XFSZ signal. And its parent, the tar process waits in a wait4 syscall
> >>>>> for lbzip2 to terminate.
> >>>>>
> >>>>> I don't have at the moment any newer OTP version installed, I'm not
> >>>>> sure how OTP 19 or 20 would behave.
> >>>>>
> >>>>> I tried to strace the processes, but there's too much noise, I
> >>>>> couldn't yet figure out anything interesting there.
> >>>>>
> >>>>> I also tried to diff OTP 17 & 18, but os:cmd/1 and friends didn't
> >>>>> change. I'm not sure about the port code, but at least the release notes
> >>>>> didn't mention anything major. Or did I miss something? Does anyone have an
> >>>>> idea what may have changed between these OTP versions?
> >>>>>
> >>>>> Thanks,
> >>>>> Daniel
> >>>>>
> >>>>> _______________________________________________
> >>>>> erlang-questions mailing list
> >>>>> erlang-questions@REDACTED
> >>>>> http://erlang.org/mailman/listinfo/erlang-questions
> >>>>>
> >>>>>
> >>>>
> >>
>
> ----------------------------------------------------------------------
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
--
More information about the erlang-questions
mailing list