[erlang-questions] is anyone else experiencing reliability issues with R15?

Thu Sep 20 12:34:06 CEST 2012

Hello,

If you ever suspect that the Erlang VM is blocking for some reason,
first make sure that it is not something in Erlang space which is
wrong. i.e. a process waiting for a message which never arrives or
something like that.

When you are sure it is the VM that is the problem the most
informative thing to do (IMO) is to either attach with gdb to that
process or dump a core using kill -ABRT.

Once you have gdb attached do info threads or a core, do:

(gdb) info threads

and then for each thread do:

(gdb) thread ${ThreadId}
(gdb) bt

This will give you a bunch of information about what the emulator is doing.

There are also a couple of tools which can help you debug specific
things within the emulator. For instance if you do

(gdb) source $ERL_TOP/erts/etc/unix/etp-commands
(gdb) etp-help

you get a list of helpfull command which can print all sorts of
interesting data. One example is etp-stacktrace, which given a Process
* will print the stacktrace of that process. eg:

(gdb) bt
#0  0x00007ffff6aa19a8 in __GI___poll (fds=0x7ffff67bac08, nfds=2,
timeout=<optimised out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1  0x000000000062c9b0 in check_fd_events (tv=0x7fffffffbf00,
ps=0x7ffff67ba100, max_res=<optimised out>) at
sys/common/erl_poll.c:1974
#2  erts_poll_wait_nkp (ps=0x7ffff67ba100, pr=0x7fffffffb700,
len=0x7fffffffbf10, utvp=<optimised out>) at
sys/common/erl_poll.c:2087
#3  0x000000000062f528 in erts_check_io_nkp (do_wait=<optimised out>)
at sys/common/erl_check_io.c:1173
#4  0x00000000006259de in erl_sys_schedule (runnable=<optimised out>)
at sys/unix/sys.c:2734
#5  0x0000000000551de5 in scheduler_wait (rq=0x7ffff687b080,
esdp=0x7ffff687b2c0, fcalls=<synthetic pointer>) at
beam/erl_process.c:2195
#6  schedule (p=<optimised out>, calls=<optimised out>) at
beam/erl_process.c:6377
#7  0x00000000005c8867 in process_main () at beam/beam_emu.c:1229
#8  0x000000000051334c in erl_start (argc=10, argv=<optimised out>) at
beam/erl_init.c:1493
#9  0x00000000004f55f9 in main (argc=<optimised out>, argv=<optimised
out>) at sys/unix/erl_main.c:29
(gdb) f 7
#7  0x00000000005c8867 in process_main () at beam/beam_emu.c:1229
(gdb) etp-stacktrace c_p
% Stacktrace (22): <the non-value>.
#Cp<user:do_io_request/5+0x78>.
#Cp<user:server_loop/2+0x5a0>.
#Cp<user:catch_loop/3+0x90>.
#Cp<terminate process normally>.
(gdb) p c_p
$1 = (Process *) 0x7ffff7e87348
(gdb)

If the VM seems to block for a while and then continues to run it
could be because all schedulers are hitting the same lock at the same
time. Use an emulator with --enable-lock-counter[1] to figure out
which lock it is that is causing the issue. Also gprof and oprofile
can be very useful when used correctly, though their output is at
times quite hard to interpret.

If when investigating you find something that seems fishy, try to
limit the scope of the potential bug as much as you can. The more
specific you are in the description of the (miss)behaviour you are
experiencing, the more likely it will be that we can help you.

Lukas

[1]: http://www.erlang.org/doc/apps/tools/lcnt_chapter.html

On Thu, Sep 20, 2012 at 8:43 AM, Ali Sabil <ali.sabil@REDACTED> wrote:
> We have experienced similar issues with R15B01 where the I/O will get
> completely blocked, but we haven't really been able to track it down,
> the suspect we had was the usage of sendfile.
>
> On Thu, Sep 20, 2012 at 3:02 AM, Fred Hebert <mononcqc@REDACTED> wrote:
>> We've had problems in R15B01 with particular statistics functions related to
>> schedulers, as described in
>> http://erlang.org/pipermail/erlang-bugs/2012-July/002964.html
>>
>> To date there is no solution and we just stopped using these functions,
>> going back to run queues.
>>
>> We also have seen a non-negligible increase in CPU usage from R14B*
>> versions, easily around 20% or so during regular workload, although it
>> didn't seem to affect heavy overload situations too negatively for us (no
>> precise measurements were made for this, just casual observations). It
>> remained high no matter what arguments we gave to the VM.
>>
>> We have noticed nodes getting locked-up in R15B01 from time to time when
>> memory on the server is getting rare, taken by other applications -- it
>> seemed we had a lot of contention on proc_tab mutexes, but nothing came out
>> of it. We eventually reduced memory usage in other applications and things
>> have been rather stable since then.
>>
>> Other than that, everything appeared normal, and none of the blocking
>> incidents could be directly attributed to issues you appear to have. We
>> haven't seen memory ballooning except in occasional error logger cases, but
>> most of our processes are extremely short-lived (well under <150ms).
>>
>> On 12-09-19 4:11 PM, Rapsey wrote:
>>
>> We run a network of custom built streaming servers doing video streaming and
>> transcoding of IPTV channels.
>> On R14 everything runs great. But switching to R15, gen_servers inexplicably
>> block and don't respond to messages, even the console blocks and does not
>> respond to input for 30s or so, processes baloon taking up large amounts of
>> memory for no reason. All at random times, but gets much worse once there
>> are more users connected to the server doing a lot of req/s or receiving a
>> lot of data.
>> We're running ubuntu server and start erlang with these switches:
>> erl +Bd +S 4 +P 1000000 -env ERL_MAX_PORTS 100000 +K true +A 32
>>
>> Are we alone having problems with R15? We tried R15B01 and R15B02.
>>
>>
>>
>> Sergej
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions