[erlang-questions] "New" vs. "old" console behavior: bug or feature?

Wed Apr 24 15:46:02 CEST 2013

Hi Scott,

The IO world of Erlang is a fun crazy thing :)

I've spent time trying to document how the shell works back at
http://ferd.ca/repl-a-bit-more-and-less-than-that.html. I'll do a quick
roundup of things just to be clear on everything.

Before going into the difference between 'new' and 'old' shells, there
is a 'user' process, which you mentioned, part of the IO system. The
'user' process acts as a default top-level group leader for all the
output coming from a process. All group leaders are inherited from the
process' parent. They can also be modified, so that you may have
different group leaders across a VM: they are local processes,
middle-men (like application_controller), or remote processes (this is
how RPC calls get printed to everyone any time).

By default, every OTP app will put its controller as a group leader for
all sub-processes. This group leader will redirect output, but overload
the feature to kill rogue processes on shutdown (it makes a list of all
processes, inspects their group leader, and if it's the current app's
pid, kills said process). Other tools like eunit and Common Test will
have the possibility of injecting themselves above test cases and pick
what to print or not. By sending IO directly to 'user', we bypass that
hierarchy and go straight to the node's main IO process. Other special
cases can be used, such as 'standard_error', which will redirect output
to the error channel.

That being said, there are two default implementations of a process that
registers itself as 'user' on a node: the new (current) shell, and the
'old' shell. The choice of which one to pick is determined at boot time
by the user_sup.erl module (part of kernel) through system flags:

- If the node is a slave node, the 'user' module will point to a remote
  process.
- If the node is started with no special flag, the new shell is started
  through 'user_drv'. This 'user' proc will act as a middle-man between
  input and output with a tty program and the different Erlang groups
  (see group.erl in kernel) to allow multiple jobs and concurrent shells
  without messed up output. Evaluation is handled by shell.erl (stdlib)
- If the node is started with the -oldshell flag, the process in charge
  is 'user.erl', which uses special IO devices ({fd,0,1} for IO) to deal
  with the input and output channels for the node directly. It will send
  the evaluation to shell.erl also.
- If the node is started with -noshell, the 'user.erl' module is still
  booted, but will not evaluate any input nor forward it.
- If the node is started in -noinput mode, the 'user.erl' module is
  still booted, but it will not forward any input, only output from the
  node. It's a superset of -noshell and a bit safer because it opens the
  IO port in a way that only has the 'out' channel open.
- There is an undocumented -nouser flag. Such a flag makes sure that
  neither user.erl nor user_drv.erl are started. The node will crash
  unless you specifically decide to start a process that registers
  itself as 'user' and decides to handle IO for your node. This is what
  you should use were you planning to provide your own Erlang shell and
  boot it as 'erl -nouser -s custom_shell'.
- If it's not possible to boot the tty used by 'user_drv', it should
  fall-back to 'user.erl' as an IO leader.

Alright. That covers most of it for the basics.

To figure out why it blocks, we need to figure out the evaluation. The
evaluation itself happens in a shell.erl process, which does an io
request to the 'user' process (technically, its own group_leader, so
that anyone may use the evaluator where they want. It just happens to be
the 'user' process in this case).

 Input --> user.erl <---> shell.erl

The shell does an io-request to user, which asks to read characters.
The user.erl process forwards that data to the shell. The shell
attempts to evaluate it, and if there's not enough data, it asks for
more. user.erl then blocks until it can get more data to respond to the
io request.

When output is sent to 'user' it's sent as an additional io request, as
a message. This message will not be read until the shell can answer the
previous request. This is where you block.

 Input --> user.erl <---> shell.erl
            ^----> other proc

The new shell does things differently by using a 'group.erl' process for
each IO group. Now each group.erl process has the same potential to
block, with the exception that user_drv.erl will start one very specific
'group.erl' process to be 'user', and will not return it as a potential
shell.erl input source (it would be 0 in '^G -> j', and it is not
possible to select it). user_drv will also consider it to be a special
group that can *always* output to tty, wheras other groups will only
have their output dumped by default if they're not the currently active
one (hence you do not get other shells' output by default when you
switch tasks). This means that while you could block things by finding
the specific 'group.erl' you're currently sending IO requests to by
default, it's unlikely to happen by accident, and 'user' is now a safe
process to send IO requests to.

I hope this explains things. I would find it difficult to call it a bug
given a solution exists to the problem already, but I do see why the
fallback to the old shell when no tty is available could be problematic.
I'm guessing it would be possible to make a 'raw shell', which does
tasks similar to user_drv, but using a user.erl-like adapter instead of
a tty program to communicate with and starting it with 'erl -nouser -s
rawshell' or something, or eventually making it the default user_drv
falls back to instead of 'user:start()'. I'm guessing this would be a
very low priority for the OTP team, though.

I hope this lengthy response answers your questions!

Regards,
Fred.

On 04/23, Scott Lystig Fritchie wrote:
> Hi, all.  I can't figure out if this message should be sent to the
> erlang-bugs list or the erlang-questions list ... so I'll go for the
> more general audience.
> 
> Summary: Starting Erlang with a tty/pseudo-tty can get you a different
> console shell ("new" and "old", respectively) without you realizing
> it.(*) If you don't know that you're using the old shell, and if a
> process tries to send output to the 'user' registered process(**),
> e.g. io:format(user, "Some message with ~p extra\n", [Extra]), then it
> is possible that the io:format() call will not return for
> seconds/minutes/hours/ever.
> 
> My question: Is the kind of indefinite blocking on I/O described below a
>              bug or a feature?
> 
> I have a test case that can reproduce this behavior.  An automated
> version (using Expect) can be found at:
> 
>     https://gist.github.com/slfritchie/ad8e5cf1603cbe326be7
> 
> The basics of the reproducing the hang are:
> 
>     SSH session #1                      SSH session #2
>     --------------                      --------------
>     Start an Erlang daemon
>     using "run_erl".
> 
>     Attach to the daemon's console
>     using "to_erl".
> 
>                                         Start another Erlang VM
>                                         and connect to the first
>                                         VM via "-remsh".
> 
>     At the console, type the
>     following and press ENTER:
>         {term1, 
> 
>                                         Run this command:
>                                             io:format(user, "Hey!\n", []).
> 
> The io:format/3 call in session #2 will behave differently if session
> #1's "run_erl" command runs with a tty/pseudo-tty or without.
> 
>     A. With a tty/pty: The io:format() call returns immediately.
>     B. Without a tty/pty: The io:format() call will hang indefinitely.
>        It will remain blocked until the Erlang term parser in session #1
>        has returned.  For example, finishing the term with "term2}." and
>        then pressing ENTER.
> 
> The same effect can be seen by forcing the use of the old shell, without
> using SSH, by simply running "erl -oldshell" for session #1 (in an Xterm
> or other terminal window, or at the machine's hardware console) instead
> of using SSH + "run_erl" + "to_erl".
> 
> Riak was the application that triggered this bug hunt (in conjunction
> with the Lager app)(***).  Finding it has taken much longer than anyone
> guessed.  The reason is that the necessary precondition, starting Erlang
> via 'run_erl' via SSH without an associated tty/pseudo-tty, is not
> common.  (Riak's packaging uses "sudo", which refuses to run if there
> isn't a tty/pty available.)
> 
> All attempts to duplicate the behavior failed because we didn't
> understand that the root cause of the bad behavior was the old console
> being silently chosen at VM startup when not tty/pty is available.
> 
> -Scott
> 
> (*) See
> https://github.com/erlang/otp/blob/maint/lib/kernel/src/user_drv.erl#L103
> for how the choice is made.
> 
> (**) From the 'io' man page:
> 
>        There is always a process registered under the name of user. This
>        can be used for sending output to the user.
> 
> ... where "output to the user" really means "output to the Erlang
> virtual machine console."
> 
> (***) For source code of Riak and Lager, respectively, see:
>     https://github.com/basho/riak
>     https://github.com/basho/lager
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions