[erlang-questions] os:cmd hang on OTP 18

Dániel Szoboszlay dszoboszlay@REDACTED
Fri Jul 28 11:01:18 CEST 2017


I think I found the bug. After a fork/vfork the child process needs some
initialisation. In case of vfork, this is done by executing the
erl_child_setup program. In case of a fork however the setup code is in the
same module, sys.c, where the fork happens. There are comments about how
important is to keep erl_child_setup.c and the relevant parts of sys.c in
sync.

In OTP 18 erts began to use a new signal, ERTS_SYS_SUSPEND_SIGNAL
internally. This signal is in fact SIGUSR2. The child process has to
unblock all signals used by erts as part of its initialisation. And there's
an inconsistency here between the vfork
<https://github.com/erlang/otp/blob/OTP-18.3.4.5/erts/emulator/sys/unix/erl_child_setup.c#L134-L136>
and fork
<https://github.com/erlang/otp/blob/OTP-18.3.4.5/erts/emulator/sys/unix/sys.c#L1005-L1022>
cases: erl_child_setup.c does not unblock SIGUSR2.

And it turns out that lbzip2 wants to use SIGUSR2 for communication between
its worker processes, but this signal is blocked when we call it from
Erlang's os:cmd (with vfork), so the program hangs.

This patch to erts/emulator/sys/unix/erl_child_setup.c solved the problem:

@@ -134,6 +134,7 @@ main(int argc, char *argv[])
     sys_sigrelease(SIGCHLD);
     sys_sigrelease(SIGINT);
     sys_sigrelease(SIGUSR1);
+    sys_sigrelease(SIGUSR2);

     if (erts_spawn_executable) {
  if (argv[CS_ARGV_NO_OF_ARGS + 1] == NULL) {


Daniel

On Thu, 27 Jul 2017 at 09:21 Dániel Szoboszlay <dszoboszlay@REDACTED>
wrote:

> Thanks Dmytro, this really helped a lot!
>
> I think the commit you pointed to is not directly related: it only changes
> Erlang code, and if the behaviour depends on whether you are using a
> release/debug build, the root cause is most probably somewhere in the C
> code of erts.
>
> But the commit message talks about the emulator no longer using vfork, and
> it was a good clue: disabling vfork on 18 prevents the problem. So this one
> will finish:
>
> ERL_NO_VFORK=true erl +A0 +S 1:1 -noinput -noshell -eval 'os:cmd("tar -C
> /tmp/ -xf /tmp/tartest --use-compress-program=lbzip2"), init:stop().'
>
> Thanks again for the clue, I will look into the difference between using
> fork/vfork in OTP 18!
>
> Daniel
>
> On Thu, 27 Jul 2017 at 02:09 Dmytro Lytovchenko <
> dmytro.lytovchenko@REDACTED> wrote:
>
>> I could observe the behaviour only in R18, but not in R19 and not in R20
>> I also could not reproduce it in debug flavour of R18 emulator, but it
>> reproduces reliably in release SMP variant.
>>
>> The changes to os.erl between 18.3.4.5 and 19.0 include removal of os:cmd
>> server which might somehow be related (commit
>> *200247f972b012ced0c4b2c6611f091af66ebedd*). This commit *possibly*
>> fixes the behavior — in R19 (build 19.0 by Kerl) the behaviour does not
>> happen.
>>
>> 2017-07-26 21:47 GMT+02:00 Dániel Szoboszlay <dszoboszlay@REDACTED>:
>>
>>> Honestly, I didn't try with other command variations. There are many
>>> commands that do not hang when run from os:cmd, regardless of the OTP
>>> version. But this particular command does hang with one OTP version, and
>>> not with the other OTP version. So the difference is in OTP, and I want to
>>> find out what has changed.
>>>
>>> Daniel
>>>
>>> On Wed, 26 Jul 2017 at 21:34 Dmytro Lytovchenko <
>>> dmytro.lytovchenko@REDACTED> wrote:
>>>
>>>> Is it something lbzip2 related?
>>>> Did you try normal single-thread bzip2? (-j flag or --bzip2)
>>>> What is you use gzip? (-z or --gzip)
>>>>
>>>> 2017-07-26 21:27 GMT+02:00 Dániel Szoboszlay <dszoboszlay@REDACTED>:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've encountered a strange problem with os:cmd when running tar and
>>>>> lbzip2. Steps to reproduce:
>>>>>
>>>>> # create some lbzip2 compressed data
>>>>>
>>>>> dd if=/dev/urandom of=/tmp/testfile count=10
>>>>> tar -cf - -C /tmp testfile | lbzip2 -6 -n 4 | dd of=/tmp/tartest status=none
>>>>>
>>>>>
>>>>> # try to extract the archive from Erlang with os:cmd
>>>>>
>>>>> erl -noinput -eval 'os:cmd("tar -C /tmp/ -xf /tmp/tartest --use-compress-program=lbzip2"), init:stop().'
>>>>>
>>>>>
>>>>> This worked fine with OTP 17.5.6.7, but with OTP 18.3.4.5 the command
>>>>> hangs: lbzip2 just sits in a rt_sigsuspend syscall waiting for a USR2, PIPE
>>>>> or XFSZ signal. And its parent, the tar process waits in a wait4 syscall
>>>>> for lbzip2 to terminate.
>>>>>
>>>>> I don't have at the moment any newer OTP version installed, I'm not
>>>>> sure how OTP 19 or 20 would behave.
>>>>>
>>>>> I tried to strace the processes, but there's too much noise, I
>>>>> couldn't yet figure out anything interesting there.
>>>>>
>>>>> I also tried to diff OTP 17 & 18, but os:cmd/1 and friends didn't
>>>>> change. I'm not sure about the port code, but at least the release notes
>>>>> didn't mention anything major. Or did I miss something? Does anyone have an
>>>>> idea what may have changed between these OTP versions?
>>>>>
>>>>> Thanks,
>>>>> Daniel
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170728/e8d935c7/attachment.htm>


More information about the erlang-questions mailing list