[erlang-questions] user process suspended

Scott Lystig Fritchie fritchie@REDACTED
Thu Sep 24 07:37:22 CEST 2015


Samuel <samuelrivas@REDACTED> wrote:

sr> We are starting to suspect that it is the lager console_backed that
sr> hangs in a send to the io port. We have just disabled it and we'll
sr> see if it helps.

Hi, Samuel.  Basho has fought a similar problem with console I/O
hanging.  IIRC we had the most problems with Solaris/Illumos/SmartOS.
My poor memory is forgetting which BEAM runtimes were succeptible.  I
also don't know if or how often the problem was reported to us on other
OS platforms.

There were a couple of diagnostics that we used:

1. Getting process_info(Pid, backtrace) for all procs.  Then we'd see
that the ... hrm, which process ... hrrrrm, Riak version 1.2 used R15B01,
I think ... the 'user' process?  This may or may not resemble the hang,
see the end of this message.

2. Connecting to the Riak console via the "to_erl" scheme to talk to The
VM's Actual Console.  Then the admin would press Return ... and the Riak
node would spring to life, problem solved.  Until next time.

IIRC disabling console output was the workaround.

I've lost the ops history of when that problem was fixed VM/runtime-wise
and/or Riak's workarounds.  But if this sounds like your problem, then
it's worth digging for that history.

Oh, here's a piece of luck.  I've rediscovered that I'd written an
Expect script to automate the process for triggering the problem.  I'll
attach that, also, in case it's helpful.

-Scott

--- snip --- snip --- snip --- snip ---

=proc:<0.30.0>
State: Waiting
Name: user
Spawned as: erlang:apply/2
Current call: user:get_chars_more/7
Spawned by: <0.29.0>
Started: Tue Apr  9 16:16:46 2013
Message queue length: 1
Message queue: [{io_request,<0.186.0>,<0.179.0>,{put_chars,unicode,<<821 bytes>>}}]
Number of heap fragments: 0
Heap fragment data: 0
Link list: []
Dictionary: [{shell,<0.31.0>},{read_mode,list},{unicode,false}]
Reductions: 357542
Stack+heap: 610
OldHeap: 0
Heap unused: 152
OldHeap unused: 0
Stack dump:
Program counter: 0x00007f0c51d92930 (user:get_chars_more/7 + 232)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007f0c51931710 Return addr 0x00007f0c51d8c498 (user:do_io_request/5 + 88)
y(0)     unicode
y(1)     {[],[]}
y(2)     #Port<0.630>
y(3)     {erl_scan,tokens,[1]}
y(4)     get_until
y(5)     io_lib
y(6)     {erl_scan_continuation,[],no_col,[],2,{erl_scan,#Fun<erl_scan.3.84904554>,false,false,false},0,#Fun<erl_scan.25.84904554>}

0x00007f0c51931750 Return addr 0x00007f0c51d8c308 (user:server_loop/2 + 1408)
y(0)     #Port<0.630>
y(1)     <0.30.0>
y(2)     <0.25333.1728>

0x00007f0c51931770 Return addr 0x00007f0c51d8b968 (user:catch_loop/3 + 112)
y(0)     #Port<0.630>

0x00007f0c51931780 Return addr 0x0000000000836c78 (<terminate process normally>)
y(0)     <0.31.0>
y(1)     #Port<0.630>
y(2)     Catch 0x00007f0c51d8b968 (user:catch_loop/3 + 112)

--- snip --- snip --- snip --- snip ---

#!/usr/bin/expect --

## Assumptions:
##
## 1. SSH has been used to log into "hostname" once before: we aren't
##    smart enough to deal with SSH host key verification stuff.
##
## 2. This script can successfully use SSH to log into hostname using
##    the "riak" user.
##
## 3. If SSH does not require a password for the "riak" user on
##    hostname, then please break that feature so that the SSH
##    password authentication method is required.
##
## 4. We assume that the "riak" account's shell prompt will include
##    a "$" in it.  If that is not a correct assumption, please fix
##    the login_prompt procedure below.
##
## 5. This script will stop Riak and kill any other beam.smp processes
##    that happen to be running on hostname.
##
## 6. We assume riak is SMP-capable and look at beam.smp processes only.


## Usage: this-script hostname password riak_path erl_path cookie riak_node_name
##
## For example, if the remote machine is using a Linux CentOS Riak package:
##
## expect this-script 172.16.72.130 foofoo /usr/sbin/riak /usr/lib64/riak/erts-5.9.1/bin/erl riak riak@REDACTED

## Reminder reminder reminder: never use single quotes
## Reminder reminder reminder: never use single quotes
## Reminder reminder reminder: never use single quotes

set timeout 30
set erl_prompt {@[^)]+\)\d+> }

set hostname [lindex $argv 0]
set password [lindex $argv 1]
set riak_path [lindex $argv 2]
set erl_path [lindex $argv 3]
set cookie [lindex $argv 4]
set riak_node_name [lindex $argv 5]

proc login_prompt {} {
    expect -re {\$}
    # If a dollar sign does not appear in the prompt, comment out
    # the line above, uncomment the line below, then edit the line
    # below to include the correct prompt string.
    # expect -re {some prompt regexp here}
}

proc erlang_prompt {} {
    global erl_prompt

    expect -re $erl_prompt
}

proc cmd {cmd} {
    send "$cmd\r"
    login_prompt
}

proc async_cmd {cmd} {
    send "$cmd\r"
}

proc erlang_cmd {cmd} {
    send "$cmd\r"
    erlang_prompt
}

proc announce {msg} {
    puts "\n\n*** [exec date]\n*** $msg"
}

proc switch_to {id} {
    global spawn_id

    announce "Switching to session $id"
    set spawn_id $id
}

# send_user "Password for host $hostname: "
# stty -echo
# expect_user -re "(.*)\n"
# stty echo
# set password $expect_out(1,string)
# puts ""

    spawn ssh -l riak $hostname
    set session1 $spawn_id
    expect {password}
    send "$password\r"
    login_prompt

announce "Killing Riak and all other beam.smp processes"
switch_to $session1
cmd {ps axww | grep beam.smp | grep -v grep | awk '{print $1}' | xargs kill -9 ; sleep 1}
cmd {ps -ef | grep beam.smp | grep -v grep | awk '{print $2}' | xargs kill -9 ; sleep 1}

announce "Starting Riak"
    spawn ssh -l riak $hostname "$riak_path start"
    set session2 $spawn_id
    expect {password}
    send "$password\r"
    expect "ok"
set after_start 10
announce "Got ok from '$riak_path start', sleeping $after_start seconds..."
    sleep $after_start

announce "Attach to riak, do sanity test"
switch_to $session1
async_cmd "$riak_path attach"
send "hello_world.\r"
erlang_prompt

announce "2nd login session to $hostname, attach to riak via remsh"

    spawn ssh -l riak $hostname
    set session2 $spawn_id
    expect {password}
    send "$password\r"
    login_prompt

switch_to $session2
async_cmd "$erl_path -setcookie $cookie -name tmp@REDACTED -remsh $riak_node_name"
erlang_prompt
erlang_cmd "hello."

announce "Log a message via lager, watch for it over on Riak console"
switch_to $session2
erlang_cmd "lager:log(warning, self(), \"WARNING 1\", \[\])."

switch_to $session1
expect {
    "WARNING 1" { announce "Found our first warning, excellent!" }
    timeout     { puts "BUMMER 1!" ; exit 4 }
}

announce "Now, to hang the logger to demonstrate the bug.\n    Any incomplete term is fine, we'll use an unfinished tuple...."
switch_to $session1
async_cmd "{tuple_item_number_1,"

set bug_timeout 12
announce "Now that console is hung, call lager:log().\n    Set timeout time to $bug_timeout seconds.\n    This function should execute in only a few milliseconds."

set timeout $bug_timeout

switch_to $session2
async_cmd "lager:log(warning, self(), \"WARNING 2\", \[\])."

switch_to $session1
expect {
    "WARNING 2" { announce "BUMMER, Found 2nd warning at wrong time"; exit 4}
    timeout     { announce "TIMEOUT! Excellent, the bug has been demonstrated (1)" }
}

announce "When we finish the erlang term over on $session1 ('riak attach' session),\n    then the logger will become unblocked."

switch_to $session1
async_cmd "element_number_2}."
expect {
    -re $erl_prompt { announce "Excellent, we have 'riak attach' prompt" }
    timeout     { announce "BUMMER, timeout when not expected (2)"; exit 4}
}

announce "Now go to session $session2 and expect the 2nd warning\n    _and_ the -remsh's prompt."
switch_to $session2
expect {
    "WARNING 2" { announce "Excellent, I see the 'WARNING 2' log message" }
    timeout     { announce "BUMMER, timeout when not expected @ 2nd w"; exit 4}
}
expect {
    -re $erl_prompt { announce "Excellent, we have the 'riak -remsh' prompt" }
    timeout     { announce "BUMMER, timeout when not expected (3)"; exit 4}
}

announce "Time to clean up both sessions"

switch_to $session1
async_cmd "\004"
login_prompt
async_cmd "exit\r"
expect eof

switch_to $session2
async_cmd "\003" ; sleep 1
async_cmd "\003" ; sleep 1
login_prompt
async_cmd "exit\r"
expect eof

announce "Excellent, this successful test is finished."
exit 0



More information about the erlang-questions mailing list