<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 16, 2020 at 7:12 PM José Valim <<a href="mailto:jose.valim@gmail.com">jose.valim@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi everyone,</div><div><br></div><div>I am working on Tensorflow bindings and, at some point, Tensorflow forks a child process to invoke a separate program. Unfortunately, when running inside the Erlang VM, Tensorflow fails when calling waitpid, <a href="https://github.com/tensorflow/tensorflow/blob/7b637feb1d145d606a7b69481fd4943f3086d5a2/tensorflow/core/platform/default/subprocess.cc#L314-L323" target="_blank">in exactly this line</a>.<br></div><div><br></div><div>After some debugging, we found out the root cause is because the Erlang VM sets SIGCHLD to SIG_IGN. <a href="https://www.mkssoftware.com/docs/man3/waitpid.3.asp" target="_blank">According to waitpid docs</a>:</div><div><br></div><div>> If the calling process sets SIGCHLD to SIG_IGN, and the process has no unwaited for children that were transformed into zombie processes, the calling thread blocks until all of the children of the process terminate, at which time waitpid() returns -1 with errno set to ECHILD.</div><div><div><br></div><div>Setting os:set_signal(sigchld, default) fixes the issue, however, it leaves me wondering:</div><div><br></div><div>1. Is it safe to set sigchld back to default? Or is the VM expecting it to be ignored? Are there any implications we should be aware of?<br></div><div><br></div><div>2. In case it is safe to have it as a default, why is it being ignored in the first place?<br></div></div></div></blockquote><div><br></div><div><a href="https://github.com/erlang/otp/blob/master/erts/emulator/sys/unix/sys.c#L686-L694">https://github.com/erlang/otp/blob/master/erts/emulator/sys/unix/sys.c#L686-L694</a><br></div><div><br></div><div>The VM does not care but some other systems do care, eg. docker.</div><div><br></div><div>It should be fine to change it as long as you are aware that you leak zombies if erlang is run as pid 1.</div><div><br></div><div>Calling waitpid in a nif may work now, but we give no guarantee that it will work in the future. In fact, before OTP-19, doing that would have broken a lot of code.</div><div><br></div><div>Lukas</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"><b><span style="border-collapse:separate;font-family:arial;font-weight:normal"><div><span><div><span style="font-size:13px"><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"></span></div></span></div></span></div></span></b></span></div></div></div></div></div></div></div></div></div>
</blockquote></div></div>