[erlang-bugs] Race condition in TLS distribution

Magnus Henoch magnus@REDACTED
Tue Oct 20 18:51:02 CEST 2015


I just noticed an alternative solution in the example uds_server.erl. It
jumps through a number of hoops to get the priv directory even if the code
server is not running:

https://github.com/erlang/otp/blob/maint/lib/kernel/examples/uds_dist/src/uds_server.erl#L107-L148

Perhaps the crypto module could do something similar in its on_load
function.

Regards,
Magnus


On Mon, Oct 19, 2015 at 7:04 PM, Magnus Henoch <magnus@REDACTED>
wrote:

> Hi all,
>
> I'm trying to use Erlang distribution over TLS ("-proto_dist inet_tls"),
> and I've stumbled upon an interesting race condition.
>
> The kernel supervisor starts the distribution subsystem before it starts
> the code server.  Therefore, it's possible for another node to establish a
> connection to the distribution port while the code server is not yet
> running.  (Apologies for not providing a recipe for reproducing this; I
> could work on that if that would be useful.)
>
> In that case, the TLS distribution module eventually calls
> ssl:ssl_accept/2 on the connection socket.  This in turn will eventually
> call crypto:supports/0.  That's when I got this error:
>
> {error_logger,{{2015,10,19},{15,1,22}},
> supervisor_report,
> [{supervisor,{local,ssl_dist_sup}},
>  {errorContext,child_terminated},
>  {reason,{undef,[{crypto,supports,[],[]},
>
>  {tls_record,supported_protocol_versions,1,[{file,"tls_record.erl"},{line,322}]},
>                  {tls_record,supported_protocol_versions,0,
>   [{file,"tls_record.erl"},{line,257}]},
>                  {ssl,handle_options,1,[{file,"ssl.erl"},{line,617}]},
>                  {ssl,ssl_accept,3,[{file,"ssl.erl"},{line,228}]},
>
>  {ssl_tls_dist_proxy,accept_loop,4,[{file,"ssl_tls_dist_proxy.erl"},{line,152}]}]}},
>  {offender,[{pid,<0.22.0>},
>             {name,ssl_tls_dist_proxy},
>             {mfargs,{ssl_tls_dist_proxy,start_link,[]}},
>             {restart_type,permanent},
>             {shutdown,4000},
>             {child_type,worker}]}]}
>
> (though it was formatted as one long line, using the kernel's primitive
> error reporter.)
>
> Why is that function undefined, you ask.  That's because the crypto module
> has an on_load function, which calls code:priv_dir/1 to figure out where
> the NIF library is.  Since the code server isn't running yet,
> code:priv_dir/1 raises an exception, and as I just learnt from reading the
> documentation, if an on_load function raises an exception (or returns
> anything but 'ok'), the module is unloaded - and thus we get an 'undef'
> error.
>
> (This will make the ssl_tls_dist_proxy process terminate.  Its supervisor
> will restart it, but that doesn't help: it has lost its listening socket,
> and net_kernel won't ask it to open another one, rendering the node "alive"
> but unable to receive connections for distribution - but that's a separate
> issue.)
>
> I came up with the attached patch, which waits for the code server to
> start before proceeding, and that fixes the problem for me. What do you
> think about it?  Might there be a better way to solve this?
>
> Regards,
> Magnus
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151020/a9ace4e7/attachment.htm>


More information about the erlang-bugs mailing list