[erlang-bugs] Race bug in net_kernel:get_status()

Tue Jan 31 23:11:02 CET 2012

Hi, all.  I believe that there's a race condition bug in
net_kernel:get_status().  The system that we're seeing this on is based
on R14B04, but since R15B uses the same version of net_kernel.erl as
R14B04, R15B may also have this problem?

A process that calls net_kernel:nodes_info/0 can hang indefinitely.  A
backtrace of such a hung process looks like this:

    Program counter: 0x0000000004281120 (net_kernel:get_status/3 + 216)
    CP: 0x0000000000000000 (invalid)
    arity = 0

    0xfffffd7fd7b3e9d0 Return addr 0x0000000004280e68 (net_kernel:get_node_info/1 + 424)
    y(0)     'lhs@REDACTED'
    y(1)     <0.241.0>

    0xfffffd7fd7b3e9e8 Return addr 0x00000000042815b0 (net_kernel:get_nodes_info/2 + 96)
    y(0)     normal
    y(1)     {net_address,{{192,168,24,199},39734},"node-x.example.com",tcp,inet}
    y(2)     <0.241.0>
    y(3)     up

    [....]

This is coming from a Riak system that is using Basho's NIF interface
to the LevelDB embedded database, eleveldb.  It's possible for those
NIFs to block for long periods of time (60 seconds or more), which is
certainly not a nice thing for a NIF to do, but until we work out some
latency problems with the LevelDB code, the NIFs cause some really crazy
scheduling of both Erlang processes and internal file descriptor
polling.

-Scott