[erlang-questions] linked-in driver blocks net_kernel tick?

Ryan Zezeski <>
Fri Jan 21 02:48:45 CET 2011


So...can anyone explain to me why zlib:gzip/1 is causing the net_kernel tick
to be blocked?  Do linked-in drivers block it's scheduler like NIFs?  I'm
really curious on this one :)

-Ryan

On Tue, Jan 18, 2011 at 6:53 PM, Ryan Zezeski <> wrote:

> Apologies, the example I copied was run on my mac.
>
> This is what I have on the actual production machine:
>
> Erlang R14A (erts-5.8) [source] [64-bit] [smp:16:16] [rq:16]
> [async-threads:0] [hipe] [kernel-poll:false]
>
> To be certain, I ran the same example (except this time using two physical
> machines) and achieved the same result.  Namely, the 'bar' node claims 'foo'
> is not responding and thus closes the connection.  Whatever this is, I've
> now easily reproduced it on two different OSs, with 2 different Erlang
> versions.
>
> -Ryan
>
> On Tue, Jan 18, 2011 at 6:04 PM, Alain O'Dea <> wrote:
>
>> On 2011-01-18, at 18:54, Ryan Zezeski <> wrote:
>>
>> > Hi everyone,
>> >
>> > Some of you may remember my latest question where I was having weird
>> node
>> > timeout issues that I couldn't explain and I thought it might be related
>> to
>> > the messages I was passing between my nodes.  Well, I pinpointed the
>> problem
>> > to a call to zlib:gzip/1.  At first I was really surprised by this, as
>> such
>> > a harmless line of code surely should have nothing to do with the
>> ability
>> > for my nodes to communicate.  However, as I dug further I realized gzip
>> was
>> > implemented as a linked-in driver and I remember reading things about
>> how
>> > one has to take care with them because they can trash the VM with them.
>>  I
>> > don't remember reading anything about them blocking code, and even if
>> they
>> > do I fail to see why my SMP enabled node (16 cores) would allow this one
>> > thread to block the tick.  It occurred to me that maybe the scheduler
>> > responsible for that process is the one blocked by the driver.  Do
>> processes
>> > have scheduler affinity?  That would make sense, I guess.
>> >
>> > I've "fixed" this problem simply by using a plain port (i.e. run in it's
>> own
>> > OS process).  For my purposes, this actually makes more sense in the
>> > majority of the places I was making use of gzip.  Can someone enlighten
>> me
>> > as to exactly what is happening behind the scenes?
>> >
>> > To reproduce I create a random 1.3GB file:
>> >
>> > dd if=/dev/urandom of=rand bs=1048576 count=1365
>> >
>> > Then start two named nodes 'foo' and 'bar', connect them, read in the
>> file,
>> > and then compress said file.  Sometime later (I think around 60+
>> seconds)
>> > the node 'bar' will claim that 'foo' is not responding.
>> >
>> > [ ~/tmp_code/node_timeout] erl -name foo
>> > Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2]
>>
>> Your SMP node seems to be capped at smp:2:2 when it out to be smp:16.
>>  Some resource limit may be holding back the system. That said zlib should
>> not ever cause this issue.
>>
>> > [async-threads:0] [hipe] [kernel-poll:false]
>> >
>> > Eshell V5.8.1  (abort with ^G)
>> > ()1> net_adm:ping('').
>> > pong
>> > ()2> nodes().
>> > ['']
>> > ()3> {ok,Data} = file:read_file("rand").
>> > {ok,<<103,5,115,210,177,147,53,45,250,182,51,32,250,233,
>> >      39,253,102,61,73,242,18,159,45,185,232,80,33,...>>}
>> > ()4> zlib:gzip(Data).
>> > <<31,139,8,0,0,0,0,0,0,3,0,15,64,240,191,103,5,115,210,
>> >  177,147,53,45,250,182,51,32,250,233,...>>
>> > ()5>
>> >
>> >
>> > [ ~/tmp_code/node_timeout] erl -name bar
>> > Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2]
>> > [async-threads:0] [hipe] [kernel-poll:false]
>> >
>> > Eshell V5.8.1  (abort with ^G)
>> > ()1> nodes().
>> > ['']
>> > ()2>
>> > =ERROR REPORT==== 18-Jan-2011::17:16:10 ===
>> > ** Node '' not responding **
>> > ** Removing (timedout) connection **
>> >
>> >
>> > Thanks,
>> >
>> > -Ryan
>>
>
>


More information about the erlang-questions mailing list