[erlang-questions] SSL distribution issues

Mon Jan 16 10:15:02 CET 2012

Hi Paul!

2012/1/14, Paul Guyot <pguyot@REDACTED>:
> Hello,
>
> Is anyone successfully using SSL distribution on production servers?
> http://www.erlang.org/documentation/doc-5.9/lib/ssl-5.0/doc/html/ssl_distribution.html

I do not know, but to my knowledge I have only heard of experimental use of
the distribution over the old ssl-implementation and the new one is
just out, you
are early adopters I think.

> While running a couple of nodes works properly on a development machine, we
> have serious issues on a real production cluster.
> Our nodes ping other nodes very early, before our applications are started.

While erlang distribution over new ssl is better tested then erlang
distribution over
the old ssl implementation ever was the test are still fairly basic
and we plan on implementing more test in the near future. The main
goal for R15 was to be good enough to be able to get rid of old ssl.

> We observed two serious issues:
> - pinging another node randomly blocks indefinitely, whether the other node
> is pingable or not (e.g. not over SSL or with a different cookie) ;

There is one blocking problem that I know of that can happen when
a none ssl node tries to contact an ssl node. If the nodename is
fairly short the first message can be seen as the begining of a
correct "ssl/tls"-packet and then it will wait for more data that
never comes and the other end is waiting for the response for its
first message. This can be fairly easily fixed by adding a check
for the value of the first byte for handshake messages.  We will
be adding this for the next release. This has however not
interfered with the legitimate nodes in our tests.

> - after a while (after pings timeout), ssl_tls_dist_proxy just crashes.
>
> =ERROR REPORT==== 2012-01-13 16:48:58 ===
> ** Generic server ssl_tls_dist_proxy terminating
> ** Last message in was {connect,IP,25669}				<-- this is another SSL node
> with the same cookie
> ** When Server state == {state,{#Port<0.284>,#Port<0.285>},
>                                {<0.24.0>,<0.25.0>}}
> ** Reason for termination ==
> ** {{badmatch,{error,badarg}},
>     [{ssl_tls_dist_proxy,handle_call,3,
>                          [{file,"ssl_tls_dist_proxy.erl"},{line,90}]},
>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>
> The relevant code is the following:
>
> handle_call({connect, Ip, Port}, {From, _}, State) ->
>     Me = self(),
>     Pid = spawn_link(fun() -> setup_proxy(Ip, Port, Me) end),
>     receive
> 	{Pid, go_ahead, LPort} ->
> 	    Res = {ok, Socket} = try_connect(LPort),
> 	    ok = gen_tcp:controlling_process(Socket, From),		<---- line 90
> 	    flush_old_controller(From, Socket),
> 	    {reply, Res, State};
> 	{Pid, Error} ->
> 	    {reply, Error, State}
>     end;
>
> The crash happens because From is no longer alive.
>
> For the record, this is master branch and the SSL parameters are the
> following :
>
> 	-proto_dist inet_tls
> 	-ssl_dist_opt
> 		server_certfile /otp_root/ssl/${NODE_NAME}.pem
> 		client_certfile /otp_root/ssl/${NODE_NAME}.pem
> 		server_secure_renegotiate true
> 		client_secure_renegotiate true
> 		server_verify verify_peer
> 		client_verify verify_peer
> 		server_fail_if_no_peer_cert true
> 		server_cacertfile /otp_root/ssl/ca.pem
> 		client_cacertfile /otp_root/ssl/ca.pem
> 		server_depth 2
> 		client_depth 2
>
> Did we miss something obvious?

I am not sure I (we) will look in to this, it is not suppose to just
crash. I have not seen this before.

Regards Ingela Erlang/OTP team - Ericsson AB