linked-in driver blocks net_kernel tick?

Ryan Zezeski rzezeski@REDACTED
Tue Jan 18 23:24:31 CET 2011


Hi everyone,

Some of you may remember my latest question where I was having weird node
timeout issues that I couldn't explain and I thought it might be related to
the messages I was passing between my nodes.  Well, I pinpointed the problem
to a call to zlib:gzip/1.  At first I was really surprised by this, as such
a harmless line of code surely should have nothing to do with the ability
for my nodes to communicate.  However, as I dug further I realized gzip was
implemented as a linked-in driver and I remember reading things about how
one has to take care with them because they can trash the VM with them.  I
don't remember reading anything about them blocking code, and even if they
do I fail to see why my SMP enabled node (16 cores) would allow this one
thread to block the tick.  It occurred to me that maybe the scheduler
responsible for that process is the one blocked by the driver.  Do processes
have scheduler affinity?  That would make sense, I guess.

I've "fixed" this problem simply by using a plain port (i.e. run in it's own
OS process).  For my purposes, this actually makes more sense in the
majority of the places I was making use of gzip.  Can someone enlighten me
as to exactly what is happening behind the scenes?

To reproduce I create a random 1.3GB file:

dd if=/dev/urandom of=rand bs=1048576 count=1365

Then start two named nodes 'foo' and 'bar', connect them, read in the file,
and then compress said file.  Sometime later (I think around 60+ seconds)
the node 'bar' will claim that 'foo' is not responding.

[progski@REDACTED ~/tmp_code/node_timeout] erl -name foo
Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.1  (abort with ^G)
(foo@REDACTED)1> net_adm:ping('bar@REDACTED').
pong
(foo@REDACTED)2> nodes().
['bar@REDACTED']
(foo@REDACTED)3> {ok,Data} = file:read_file("rand").
{ok,<<103,5,115,210,177,147,53,45,250,182,51,32,250,233,
      39,253,102,61,73,242,18,159,45,185,232,80,33,...>>}
(foo@REDACTED)4> zlib:gzip(Data).
<<31,139,8,0,0,0,0,0,0,3,0,15,64,240,191,103,5,115,210,
  177,147,53,45,250,182,51,32,250,233,...>>
(foo@REDACTED)5>


[progski@REDACTED ~/tmp_code/node_timeout] erl -name bar
Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.1  (abort with ^G)
(bar@REDACTED)1> nodes().
['foo@REDACTED']
(bar@REDACTED)2>
=ERROR REPORT==== 18-Jan-2011::17:16:10 ===
** Node 'foo@REDACTED' not responding **
** Removing (timedout) connection **


Thanks,

-Ryan


More information about the erlang-questions mailing list