[erlang-questions] os:cmd hang on OTP 18

Mikael Pettersson <>
Fri Jul 28 11:48:57 CEST 2017


Dániel Szoboszlay writes:
 > I think I found the bug. After a fork/vfork the child process needs some
 > initialisation. In case of vfork, this is done by executing the
 > erl_child_setup program. In case of a fork however the setup code is in the
 > same module, sys.c, where the fork happens. There are comments about how
 > important is to keep erl_child_setup.c and the relevant parts of sys.c in
 > sync.
 > 
 > In OTP 18 erts began to use a new signal, ERTS_SYS_SUSPEND_SIGNAL
 > internally. This signal is in fact SIGUSR2. The child process has to
 > unblock all signals used by erts as part of its initialisation. And there's
 > an inconsistency here between the vfork
 > <https://github.com/erlang/otp/blob/OTP-18.3.4.5/erts/emulator/sys/unix/erl_child_setup.c#L134-L136>
 > and fork
 > <https://github.com/erlang/otp/blob/OTP-18.3.4.5/erts/emulator/sys/unix/sys.c#L1005-L1022>
 > cases: erl_child_setup.c does not unblock SIGUSR2.
 > 
 > And it turns out that lbzip2 wants to use SIGUSR2 for communication between
 > its worker processes, but this signal is blocked when we call it from
 > Erlang's os:cmd (with vfork), so the program hangs.
 > 
 > This patch to erts/emulator/sys/unix/erl_child_setup.c solved the problem:
 > 
 > @@ -134,6 +134,7 @@ main(int argc, char *argv[])
 >      sys_sigrelease(SIGCHLD);
 >      sys_sigrelease(SIGINT);
 >      sys_sigrelease(SIGUSR1);
 > +    sys_sigrelease(SIGUSR2);
 > 
 >      if (erts_spawn_executable) {
 >   if (argv[CS_ARGV_NO_OF_ARGS + 1] == NULL) {
 > 
 > 
 > Daniel

Nice find.

This matters because the subsequent exec only resets caught signals
(since their handlers would disappear), while blocked signals remain
blocked breaking the child's expectations.


 > 
 > On Thu, 27 Jul 2017 at 09:21 Dániel Szoboszlay <>
 > wrote:
 > 
 > > Thanks Dmytro, this really helped a lot!
 > >
 > > I think the commit you pointed to is not directly related: it only changes
 > > Erlang code, and if the behaviour depends on whether you are using a
 > > release/debug build, the root cause is most probably somewhere in the C
 > > code of erts.
 > >
 > > But the commit message talks about the emulator no longer using vfork, and
 > > it was a good clue: disabling vfork on 18 prevents the problem. So this one
 > > will finish:
 > >
 > > ERL_NO_VFORK=true erl +A0 +S 1:1 -noinput -noshell -eval 'os:cmd("tar -C
 > > /tmp/ -xf /tmp/tartest --use-compress-program=lbzip2"), init:stop().'
 > >
 > > Thanks again for the clue, I will look into the difference between using
 > > fork/vfork in OTP 18!
 > >
 > > Daniel
 > >
 > > On Thu, 27 Jul 2017 at 02:09 Dmytro Lytovchenko <
 > > > wrote:
 > >
 > >> I could observe the behaviour only in R18, but not in R19 and not in R20
 > >> I also could not reproduce it in debug flavour of R18 emulator, but it
 > >> reproduces reliably in release SMP variant.
 > >>
 > >> The changes to os.erl between 18.3.4.5 and 19.0 include removal of os:cmd
 > >> server which might somehow be related (commit
 > >> *200247f972b012ced0c4b2c6611f091af66ebedd*). This commit *possibly*
 > >> fixes the behavior — in R19 (build 19.0 by Kerl) the behaviour does not
 > >> happen.
 > >>
 > >> 2017-07-26 21:47 GMT+02:00 Dániel Szoboszlay <>:
 > >>
 > >>> Honestly, I didn't try with other command variations. There are many
 > >>> commands that do not hang when run from os:cmd, regardless of the OTP
 > >>> version. But this particular command does hang with one OTP version, and
 > >>> not with the other OTP version. So the difference is in OTP, and I want to
 > >>> find out what has changed.
 > >>>
 > >>> Daniel
 > >>>
 > >>> On Wed, 26 Jul 2017 at 21:34 Dmytro Lytovchenko <
 > >>> > wrote:
 > >>>
 > >>>> Is it something lbzip2 related?
 > >>>> Did you try normal single-thread bzip2? (-j flag or --bzip2)
 > >>>> What is you use gzip? (-z or --gzip)
 > >>>>
 > >>>> 2017-07-26 21:27 GMT+02:00 Dániel Szoboszlay <>:
 > >>>>
 > >>>>> Hi,
 > >>>>>
 > >>>>> I've encountered a strange problem with os:cmd when running tar and
 > >>>>> lbzip2. Steps to reproduce:
 > >>>>>
 > >>>>> # create some lbzip2 compressed data
 > >>>>>
 > >>>>> dd if=/dev/urandom of=/tmp/testfile count=10
 > >>>>> tar -cf - -C /tmp testfile | lbzip2 -6 -n 4 | dd of=/tmp/tartest status=none
 > >>>>>
 > >>>>>
 > >>>>> # try to extract the archive from Erlang with os:cmd
 > >>>>>
 > >>>>> erl -noinput -eval 'os:cmd("tar -C /tmp/ -xf /tmp/tartest --use-compress-program=lbzip2"), init:stop().'
 > >>>>>
 > >>>>>
 > >>>>> This worked fine with OTP 17.5.6.7, but with OTP 18.3.4.5 the command
 > >>>>> hangs: lbzip2 just sits in a rt_sigsuspend syscall waiting for a USR2, PIPE
 > >>>>> or XFSZ signal. And its parent, the tar process waits in a wait4 syscall
 > >>>>> for lbzip2 to terminate.
 > >>>>>
 > >>>>> I don't have at the moment any newer OTP version installed, I'm not
 > >>>>> sure how OTP 19 or 20 would behave.
 > >>>>>
 > >>>>> I tried to strace the processes, but there's too much noise, I
 > >>>>> couldn't yet figure out anything interesting there.
 > >>>>>
 > >>>>> I also tried to diff OTP 17 & 18, but os:cmd/1 and friends didn't
 > >>>>> change. I'm not sure about the port code, but at least the release notes
 > >>>>> didn't mention anything major. Or did I miss something? Does anyone have an
 > >>>>> idea what may have changed between these OTP versions?
 > >>>>>
 > >>>>> Thanks,
 > >>>>> Daniel
 > >>>>>
 > >>>>> _______________________________________________
 > >>>>> erlang-questions mailing list
 > >>>>> 
 > >>>>> http://erlang.org/mailman/listinfo/erlang-questions
 > >>>>>
 > >>>>>
 > >>>>
 > >>
 > 
 > ----------------------------------------------------------------------
 > _______________________________________________
 > erlang-questions mailing list
 > 
 > http://erlang.org/mailman/listinfo/erlang-questions

-- 


More information about the erlang-questions mailing list