[erlang-questions] linked-in driver blocks net_kernel tick?

Alain O'Dea alain.odea@REDACTED
Wed Jan 19 00:04:03 CET 2011


On 2011-01-18, at 18:54, Ryan Zezeski <rzezeski@REDACTED> wrote:

> Hi everyone,
> 
> Some of you may remember my latest question where I was having weird node
> timeout issues that I couldn't explain and I thought it might be related to
> the messages I was passing between my nodes.  Well, I pinpointed the problem
> to a call to zlib:gzip/1.  At first I was really surprised by this, as such
> a harmless line of code surely should have nothing to do with the ability
> for my nodes to communicate.  However, as I dug further I realized gzip was
> implemented as a linked-in driver and I remember reading things about how
> one has to take care with them because they can trash the VM with them.  I
> don't remember reading anything about them blocking code, and even if they
> do I fail to see why my SMP enabled node (16 cores) would allow this one
> thread to block the tick.  It occurred to me that maybe the scheduler
> responsible for that process is the one blocked by the driver.  Do processes
> have scheduler affinity?  That would make sense, I guess.
> 
> I've "fixed" this problem simply by using a plain port (i.e. run in it's own
> OS process).  For my purposes, this actually makes more sense in the
> majority of the places I was making use of gzip.  Can someone enlighten me
> as to exactly what is happening behind the scenes?
> 
> To reproduce I create a random 1.3GB file:
> 
> dd if=/dev/urandom of=rand bs=1048576 count=1365
> 
> Then start two named nodes 'foo' and 'bar', connect them, read in the file,
> and then compress said file.  Sometime later (I think around 60+ seconds)
> the node 'bar' will claim that 'foo' is not responding.
> 
> [progski@REDACTED ~/tmp_code/node_timeout] erl -name foo
> Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2]

Your SMP node seems to be capped at smp:2:2 when it out to be smp:16.  Some resource limit may be holding back the system. That said zlib should not ever cause this issue.

> [async-threads:0] [hipe] [kernel-poll:false]
> 
> Eshell V5.8.1  (abort with ^G)
> (foo@REDACTED)1> net_adm:ping('bar@REDACTED').
> pong
> (foo@REDACTED)2> nodes().
> ['bar@REDACTED']
> (foo@REDACTED)3> {ok,Data} = file:read_file("rand").
> {ok,<<103,5,115,210,177,147,53,45,250,182,51,32,250,233,
>      39,253,102,61,73,242,18,159,45,185,232,80,33,...>>}
> (foo@REDACTED)4> zlib:gzip(Data).
> <<31,139,8,0,0,0,0,0,0,3,0,15,64,240,191,103,5,115,210,
>  177,147,53,45,250,182,51,32,250,233,...>>
> (foo@REDACTED)5>
> 
> 
> [progski@REDACTED ~/tmp_code/node_timeout] erl -name bar
> Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2]
> [async-threads:0] [hipe] [kernel-poll:false]
> 
> Eshell V5.8.1  (abort with ^G)
> (bar@REDACTED)1> nodes().
> ['foo@REDACTED']
> (bar@REDACTED)2>
> =ERROR REPORT==== 18-Jan-2011::17:16:10 ===
> ** Node 'foo@REDACTED' not responding **
> ** Removing (timedout) connection **
> 
> 
> Thanks,
> 
> -Ryan


More information about the erlang-questions mailing list