ssl_esock spinning out of control in poll()

Richard Andrews bflatmaj7th@REDACTED
Tue Aug 11 08:18:25 CEST 2009


On Wed, Aug 5, 2009 at 11:38 AM, Richard Andrews<bflatmaj7th@REDACTED> wrote:
> I have a problem with ssl_esock in R12B4 (Linux 32 bit). Symptoms:
>  1. esock consumes 100% CPU usage
>  2. poll() spinning constantly with events=POLLIN|POLLRDNORM,
> revents=POLLIN|POLLRDNORM for the affected SSL fd
>  3. No other syscalls between polls in strace
>  4. netstat shows the TCP Rx queue growing for the socket
>  5. No data messages received at the socket owning erlang process
> (corollary of 3.)
>
> I don't yet have a test case to trigger it but it seems to occur after
> the remote SSL peer sends a moderate sized block of data (eg. 2kB).
>
> Google didn't turn up anything that looked like what I'm seeing and I
> can't find anything in mor recent OTP changelogs. Does anyone know of
> this bug and if there is a patch anywhere?

I have analysed this bug. There is a fault in the interaction between
esock_openssl.c and esock.c. The problem is triggered by bad SSL data
over the TCP socket. The trigger is the remote peer behaving badly but
the local program suffers catastrophic failure which is not
acceptable.

The openssl library correctly reports SSL_ERROR_SSL, but there is no
way to propagate this back up to the main loop. A return value < 0 is
taken to be a blocking artefact and is ignored under the assumption
that it will be rectified by a future read.

In this case there is a fatal SSL error which is unrecoverable. Calls
to SSL_read() return -1 without reading from the fd. The calling code
ignores this and goes around the loop again calling poll() which
returns immediately because there is still unread data in the TCP Rx
queue, etc.

I think what needs to happen is that cp->eof or cp->bp needs to be set
in response to SSL_ERROR_SSL so that the socket can be cleaned up
gracefully. The code comments don't provide enough guidance about
which I should set for this case.

I'm hoping that someone familiar with the code can help me develop a
patch which works the "right" way. Otherwise I'll just wing it.

--
  Rich


More information about the erlang-questions mailing list