[erlang-bugs] Race condition in TLS distribution

Magnus Henoch magnus@REDACTED
Mon Oct 19 20:04:57 CEST 2015


Hi all,

I'm trying to use Erlang distribution over TLS ("-proto_dist 
inet_tls"), and I've stumbled upon an interesting race condition.

The kernel supervisor starts the distribution subsystem before it 
starts the code server.  Therefore, it's possible for another node 
to establish a connection to the distribution port while the code 
server is not yet running.  (Apologies for not providing a recipe 
for reproducing this; I could work on that if that would be 
useful.)

In that case, the TLS distribution module eventually calls 
ssl:ssl_accept/2 on the connection socket.  This in turn will 
eventually call crypto:supports/0.  That's when I got this error:

{error_logger,{{2015,10,19},{15,1,22}},
 supervisor_report,
 [{supervisor,{local,ssl_dist_sup}},
  {errorContext,child_terminated},
  {reason,{undef,[{crypto,supports,[],[]},
                  {tls_record,supported_protocol_versions,1,[{file,"tls_record.erl"},{line,322}]},
                  {tls_record,supported_protocol_versions,0, 
                  [{file,"tls_record.erl"},{line,257}]},
                  {ssl,handle_options,1,[{file,"ssl.erl"},{line,617}]},
                  {ssl,ssl_accept,3,[{file,"ssl.erl"},{line,228}]},
                  {ssl_tls_dist_proxy,accept_loop,4,[{file,"ssl_tls_dist_proxy.erl"},{line,152}]}]}},
  {offender,[{pid,<0.22.0>},
             {name,ssl_tls_dist_proxy},
             {mfargs,{ssl_tls_dist_proxy,start_link,[]}},
             {restart_type,permanent},
             {shutdown,4000},
             {child_type,worker}]}]}

(though it was formatted as one long line, using the kernel's 
primitive error reporter.)

Why is that function undefined, you ask.  That's because the 
crypto module has an on_load function, which calls code:priv_dir/1 
to figure out where the NIF library is.  Since the code server 
isn't running yet, code:priv_dir/1 raises an exception, and as I 
just learnt from reading the documentation, if an on_load function 
raises an exception (or returns anything but 'ok'), the module is 
unloaded - and thus we get an 'undef' error.

(This will make the ssl_tls_dist_proxy process terminate.  Its 
supervisor will restart it, but that doesn't help: it has lost its 
listening socket, and net_kernel won't ask it to open another one, 
rendering the node "alive" but unable to receive connections for 
distribution - but that's a separate issue.)

I came up with the attached patch, which waits for the code server 
to start before proceeding, and that fixes the problem for me. 
What do you think about it?  Might there be a better way to solve 
this?

Regards,
Magnus

-------------- next part --------------
A non-text attachment was scrubbed...
Name: wait-for-code-server.patch
Type: text/x-patch
Size: 2024 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151019/fb7e3caf/attachment.bin>


More information about the erlang-bugs mailing list