[erlang-questions] linked-in driver blocks net_kernel tick?

Fri Jan 21 11:16:18 CET 2011

All c-calls blocks a schedulers, if they are not pushed out to a thread.

In this case it's a bug in the zlib module (probably by me) gzip should
chunk up the input before invoking the driver.

What happens is that all schedulers go to sleep because there is no work to do,
except the one invoking the driver, a ping is received and wakes up
the "distribution" process
which gets queued up on only scheduler that is awake, but that
scheduler is blocked
in an "eternal" call. The pings never become processed and the
distributions times out.

You can wait for a patch or use zlib api to chunk up compression your self, see
implementation of gzip in zlib module.

/Dan

On Fri, Jan 21, 2011 at 2:48 AM, Ryan Zezeski <rzezeski@REDACTED> wrote:
> So...can anyone explain to me why zlib:gzip/1 is causing the net_kernel tick
> to be blocked?  Do linked-in drivers block it's scheduler like NIFs?  I'm
> really curious on this one :)
>
> -Ryan
>
> On Tue, Jan 18, 2011 at 6:53 PM, Ryan Zezeski <rzezeski@REDACTED> wrote:
>
>> Apologies, the example I copied was run on my mac.
>>
>> This is what I have on the actual production machine:
>>
>> Erlang R14A (erts-5.8) [source] [64-bit] [smp:16:16] [rq:16]
>> [async-threads:0] [hipe] [kernel-poll:false]
>>
>> To be certain, I ran the same example (except this time using two physical
>> machines) and achieved the same result.  Namely, the 'bar' node claims 'foo'
>> is not responding and thus closes the connection.  Whatever this is, I've
>> now easily reproduced it on two different OSs, with 2 different Erlang
>> versions.
>>
>> -Ryan
>>
>> On Tue, Jan 18, 2011 at 6:04 PM, Alain O'Dea <alain.odea@REDACTED> wrote:
>>
>>> On 2011-01-18, at 18:54, Ryan Zezeski <rzezeski@REDACTED> wrote:
>>>
>>> > Hi everyone,
>>> >
>>> > Some of you may remember my latest question where I was having weird
>>> node
>>> > timeout issues that I couldn't explain and I thought it might be related
>>> to
>>> > the messages I was passing between my nodes.  Well, I pinpointed the
>>> problem
>>> > to a call to zlib:gzip/1.  At first I was really surprised by this, as
>>> such
>>> > a harmless line of code surely should have nothing to do with the
>>> ability
>>> > for my nodes to communicate.  However, as I dug further I realized gzip
>>> was
>>> > implemented as a linked-in driver and I remember reading things about
>>> how
>>> > one has to take care with them because they can trash the VM with them.
>>>  I
>>> > don't remember reading anything about them blocking code, and even if
>>> they
>>> > do I fail to see why my SMP enabled node (16 cores) would allow this one
>>> > thread to block the tick.  It occurred to me that maybe the scheduler
>>> > responsible for that process is the one blocked by the driver.  Do
>>> processes
>>> > have scheduler affinity?  That would make sense, I guess.
>>> >
>>> > I've "fixed" this problem simply by using a plain port (i.e. run in it's
>>> own
>>> > OS process).  For my purposes, this actually makes more sense in the
>>> > majority of the places I was making use of gzip.  Can someone enlighten
>>> me
>>> > as to exactly what is happening behind the scenes?
>>> >
>>> > To reproduce I create a random 1.3GB file:
>>> >
>>> > dd if=/dev/urandom of=rand bs=1048576 count=1365
>>> >
>>> > Then start two named nodes 'foo' and 'bar', connect them, read in the
>>> file,
>>> > and then compress said file.  Sometime later (I think around 60+
>>> seconds)
>>> > the node 'bar' will claim that 'foo' is not responding.
>>> >
>>> > [progski@REDACTED ~/tmp_code/node_timeout] erl -name foo
>>> > Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2]
>>>
>>> Your SMP node seems to be capped at smp:2:2 when it out to be smp:16.
>>>  Some resource limit may be holding back the system. That said zlib should
>>> not ever cause this issue.
>>>
>>> > [async-threads:0] [hipe] [kernel-poll:false]
>>> >
>>> > Eshell V5.8.1  (abort with ^G)
>>> > (foo@REDACTED)1> net_adm:ping('bar@REDACTED').
>>> > pong
>>> > (foo@REDACTED)2> nodes().
>>> > ['bar@REDACTED']
>>> > (foo@REDACTED)3> {ok,Data} = file:read_file("rand").
>>> > {ok,<<103,5,115,210,177,147,53,45,250,182,51,32,250,233,
>>> >      39,253,102,61,73,242,18,159,45,185,232,80,33,...>>}
>>> > (foo@REDACTED)4> zlib:gzip(Data).
>>> > <<31,139,8,0,0,0,0,0,0,3,0,15,64,240,191,103,5,115,210,
>>> >  177,147,53,45,250,182,51,32,250,233,...>>
>>> > (foo@REDACTED)5>
>>> >
>>> >
>>> > [progski@REDACTED ~/tmp_code/node_timeout] erl -name bar
>>> > Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2]
>>> > [async-threads:0] [hipe] [kernel-poll:false]
>>> >
>>> > Eshell V5.8.1  (abort with ^G)
>>> > (bar@REDACTED)1> nodes().
>>> > ['foo@REDACTED']
>>> > (bar@REDACTED)2>
>>> > =ERROR REPORT==== 18-Jan-2011::17:16:10 ===
>>> > ** Node 'foo@REDACTED' not responding **
>>> > ** Removing (timedout) connection **
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > -Ryan
>>>
>>
>>
>