[erlang-bugs] Race condition in TLS distribution
Magnus Henoch
magnus@REDACTED
Mon Oct 19 20:04:57 CEST 2015
Hi all,
I'm trying to use Erlang distribution over TLS ("-proto_dist
inet_tls"), and I've stumbled upon an interesting race condition.
The kernel supervisor starts the distribution subsystem before it
starts the code server. Therefore, it's possible for another node
to establish a connection to the distribution port while the code
server is not yet running. (Apologies for not providing a recipe
for reproducing this; I could work on that if that would be
useful.)
In that case, the TLS distribution module eventually calls
ssl:ssl_accept/2 on the connection socket. This in turn will
eventually call crypto:supports/0. That's when I got this error:
{error_logger,{{2015,10,19},{15,1,22}},
supervisor_report,
[{supervisor,{local,ssl_dist_sup}},
{errorContext,child_terminated},
{reason,{undef,[{crypto,supports,[],[]},
{tls_record,supported_protocol_versions,1,[{file,"tls_record.erl"},{line,322}]},
{tls_record,supported_protocol_versions,0,
[{file,"tls_record.erl"},{line,257}]},
{ssl,handle_options,1,[{file,"ssl.erl"},{line,617}]},
{ssl,ssl_accept,3,[{file,"ssl.erl"},{line,228}]},
{ssl_tls_dist_proxy,accept_loop,4,[{file,"ssl_tls_dist_proxy.erl"},{line,152}]}]}},
{offender,[{pid,<0.22.0>},
{name,ssl_tls_dist_proxy},
{mfargs,{ssl_tls_dist_proxy,start_link,[]}},
{restart_type,permanent},
{shutdown,4000},
{child_type,worker}]}]}
(though it was formatted as one long line, using the kernel's
primitive error reporter.)
Why is that function undefined, you ask. That's because the
crypto module has an on_load function, which calls code:priv_dir/1
to figure out where the NIF library is. Since the code server
isn't running yet, code:priv_dir/1 raises an exception, and as I
just learnt from reading the documentation, if an on_load function
raises an exception (or returns anything but 'ok'), the module is
unloaded - and thus we get an 'undef' error.
(This will make the ssl_tls_dist_proxy process terminate. Its
supervisor will restart it, but that doesn't help: it has lost its
listening socket, and net_kernel won't ask it to open another one,
rendering the node "alive" but unable to receive connections for
distribution - but that's a separate issue.)
I came up with the attached patch, which waits for the code server
to start before proceeding, and that fixes the problem for me.
What do you think about it? Might there be a better way to solve
this?
Regards,
Magnus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wait-for-code-server.patch
Type: text/x-patch
Size: 2024 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151019/fb7e3caf/attachment.bin>
More information about the erlang-bugs
mailing list