[erlang-bugs] gen_tcp:send/2 gets stuck despite send_timeout

Holger Weiß holger@REDACTED
Tue Oct 28 13:46:35 CET 2014


Hi there,

I'm an ejabberd contributor, and we're currently facing the issue that
gen_tcp:send/2 occasionally blocks forever even though a 'send_timeout'
(and 'send_timeout_close') has been specified.¹  This seems to happen
only under rare circumstances, but when it happens, it can crash the VM,
as the process that's stuck in the gen_tcp:send/2 call stops processing
its message queue and therefore eats the available memory, eventually.

This *only* seems to happen when epoll(7) is used, i.e. when "+K true"
is specified on Linux.  "+K false" makes the issue go away.

Also, it only happens when the TCP socket is no longer usable.  In the
past, it could occur that an ejabberd process called gen_tcp:send/2 even
though an earlier call returned a failure already.  Since we changed the
code to fix that, the issue is triggered less frequently; and in those
cases where it still *is* triggered, it's obvious from looking at the
details that the socket got closed more or less at the same time.

The problem is that I'm not able to reproduce this myself.  So far,
we've only been made aware of this issue on two servers, both of them
running in production, and it's only easily reproducible on one of them.
That one is running Erlang 17.1 on a Xen instance (I guess I could ask
the admin to update to 17.3).

Without code to reproduce the issue, this is probably non-trivial to
debug :-(  At least there's one live system where the issue is usually
triggered multiple times per day.  Any suggestions on how to proceed?

Thanks, Holger

¹ According to process_info/1, the current function is prim_inet:send/3.



More information about the erlang-bugs mailing list