[erlang-questions] Errors in SSL handshake (weird client)

Danil Zagoskin z@REDACTED
Fri Apr 18 18:41:34 CEST 2014


>    exception exit: {function_clause,
>                     [{tls_connection,handle_tls_handshake,

A script to reproduce this crash and a patch to handle it gently (without
leaking internal state full of secrets to user code) can be found at
https://gist.github.com/stolen/11053230


2014-04-17 22:04 GMT+04:00 Danil Zagoskin <z@REDACTED>:

> Hi!
>
> Thank you for the patches!
>
> Meanwhile we have deployed suite selection patch and now there is
> something new in crash logs:
> =CRASH REPORT==== 17-Apr-2014::20:32:24 ===
>   crasher:
>     initial call: tls_connection:init/1
>     pid: <0.30376.19>
>     registered_name: []
>     exception exit: {function_clause,
>                      [{tls_connection,handle_tls_handshake,
>                        [#Fun<tls_connection.0.100316242>,abbreviated,
>                         #state{role = server,
>                              %% LOTS OF STUFF SKIPPED %%
>                              connection_states = #connection_states{
>                                       current_read = %% cipher_suite =
> <<0,0>> %%
>                                       pending_read = %% cipher_suite
> = <<0,51>> %%
>                              }
>                              protocol_buffers = #protocol_buffers{
>                                      tls_packets = [],
>                                      tls_handshake_buffer =
> <<92,64,37,228,209,208,235,232,47,69,57,232,90,106, ..... >>
>                                      %% SKIPPED %%},
>                              tls_handshake_history = {
>                                      [
>
>  [20,<<0,0,12>>,<<111,40,244,7,137,224,16,109,197,110,249,152>>],
>
>  [2,<<0,0,70>>,<<3,2,83,80,2,23,175,124,25,138,178,243,166,22,
>
>  150,239,220,170,215,108,54,169,67,79,124,183,
>
>  189,89,35,239,244,87,102,19,32,11,15,20,173,67,
>
>  209,70,180,252,109,228,108,60,203,91,126,77,12,
>
>  231,203,43,221,11,234,147,28,231,91,18,49,228,
>                                                          70,0,51,0>>],
>
>  <<1,0,0,141,3,2,83,80,4,31,206,23,107,223,241,244,
>
>  164,4,198,176,44,155,63,58,157,103,95,44,198,83,
>
>  237,110,165,99,242,190,122,83,32,11,15,20,173,
>
>  67,209,70,180,252,109,228,108,60,203,91,126,77,
>
>  12,231,203,43,221,11,234,147,28,231,91,18,49,
>
>  228,70,0,36,0,51,0,69,0,57,0,136,0,22,0,50,0,68,
>
>  0,56,0,135,0,19,0,102,0,47,0,65,0,53,0,132,0,10,
>
>  0,5,0,4,1,0,0,32,0,9,0,3,2,0,1,0,0,0,21,0,19,0,
>
>  0,16,119,101,98,100,97,118,46,121,97,110,100,
>
>  101,120,46,114,117>>
>                                     ],
>                                     [..........]
>                              },
>                              %% LOTS OF STUFF SKIPPED %%
>                          }],
>                        [{file,"tls_connection.erl"},{line,735}]},
>                       {tls_connection,next_state,4,
>                        [{file,"tls_connection.erl"},{line,458}]},
>
> {gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},
>                       {proc_lib,init_p_do_apply,3,
>                        [{file,"proc_lib.erl"},{line,239}]}]}
>       in function  gen_fsm:terminate/7 (gen_fsm.erl, line 622)
>
> Could that client work with other server implementation?
> tls_handshake_buffer seems like a garbage (but I'm not sure), so maybe
> here's no valid client lost.
>
> Should server tolerate such conditions? Should an alert be thrown
> when tls_handshake:get_tls_handshake returns {[], _}?
>
>
> Another question about crashes:
> Maybe tls_connection is a good place for format_status/2? It may be a good
> idea to hide secrets when printing state (otherwise a person which has
> access to error logs seems to be able to restore the private key or
> decipher dumped traffic).
>
>
>
> 2014-04-17 19:04 GMT+04:00 Ingela Andin <ingela.andin@REDACTED>:
>
> Hi!
>>
>> I pushed some more patches to my gitrep on the branches:
>>
>> ia/ssl/recv/OTP-11878  (error handling of  recv on an active socket)
>>
>> ia/ssl/decrypt-alert   (correct decryption alert handling)
>>
>> Regards Ingela Erlang/OTP team - Ericsson AB
>>
>>
>>
>>
>> 2014-04-17 11:19 GMT+02:00 Ingela Andin <ingela.andin@REDACTED>:
>>
>> Hi Danil!
>>>
>>>
>>> 2014-04-17 8:52 GMT+02:00 Danil Zagoskin <z@REDACTED>:
>>>
>>> Hi, Ingela!
>>>>
>>>> The patch works like a charm, thank you! It completely fixes the
>>>> initial problem with buggy client.
>>>>
>>>>
>>> One bug down!
>>>
>>>
>>>
>>>>
>>>> But there are some sad news, too:
>>>>
>>>>
>>> Well reproducible is always good, means we can haunt it down and fix it
>>> :)
>>>
>>>
>>>
>>>> While playing with my dummy server (
>>>> https://github.com/stolen/ssldump/blob/master/src/ssldump.erl#L70) I
>>>> accidentally reproduced badarg in erlang:size once more:
>>>>
>>>> =ERROR REPORT==== 17-Apr-2014::07:34:58 ===
>>>> ** State machine <0.58.0> terminating
>>>> ** Last message in was {tcp,#Port<0.1498>,
>>>>
>>>> [22,3,1,0,173,1,0,0,169,3,3,83,79,75,226,78,81,8,
>>>>
>>>>  77,79,97,154,127,165,92,183,20,239,70,183,158,
>>>>
>>>>  193,222,139,77,116,13,128,12,66,120,63,110,0,0,
>>>>
>>>>  94,0,255,192,36,192,35,192,10,192,9,192,7,192,8,
>>>>
>>>>  192,40,192,39,192,20,192,19,192,17,192,18,192,38,
>>>>
>>>>  192,37,192,42,192,41,192,5,192,4,192,2,192,3,192,
>>>>
>>>>  15,192,14,192,12,192,13,0,61,0,60,0,47,0,5,0,4,0,
>>>>
>>>>  53,0,10,0,103,0,107,0,51,0,57,0,22,0,175,0,174,0,
>>>>
>>>>  141,0,140,0,138,0,139,0,177,0,176,0,44,0,59,1,0,
>>>>
>>>>  0,34,0,10,0,8,0,6,0,23,0,24,0,25,0,11,0,2,1,0,0,
>>>>                              13,0,12,0,10,5,1,4,1,2,1,4,3,2,3]}
>>>> (meaningless state omitted)
>>>> ** Reason for termination =
>>>> ** {badarg,[{erlang,size,
>>>>
>>>> [[22,3,1,0,173,1,0,0,169,3,3,83,79,75,226,78,81,8,77,79,
>>>>
>>>> 97,154,127,165,92,183,20,239,70,183,158,193,222,139,77,
>>>>
>>>> 116,13,128,12,66,120,63,110,0,0,94,0,255,192,36,192,35,
>>>>
>>>> 192,10,192,9,192,7,192,8,192,40,192,39,192,20,192,19,
>>>>
>>>> 192,17,192,18,192,38,192,37,192,42,192,41,192,5,192,4,
>>>>
>>>> 192,2,192,3,192,15,192,14,192,12,192,13,0,61,0,60,0,47,
>>>>
>>>> 0,5,0,4,0,53,0,10,0,103,0,107,0,51,0,57,0,22,0,175,0,
>>>>
>>>> 174,0,141,0,140,0,138,0,139,0,177,0,176,0,44,0,59,1,0,0,
>>>>
>>>> 34,0,10,0,8,0,6,0,23,0,24,0,25,0,11,0,2,1,0,0,13,0,12,0,
>>>>                       10,5,1,4,1,2,1,4,3,2,3]],
>>>>                     []},
>>>>             {tls_record,get_tls_records_aux,2,
>>>>                         [{file,"tls_record.erl"},{line,122}]},
>>>>             {tls_connection,next_tls_record,2,
>>>>                             [{file,"tls_connection.erl"},{line,484}]},
>>>>             {tls_connection,handle_info,3,
>>>>                             [{file,"tls_connection.erl"},{line,307}]},
>>>>              {gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,503}]},
>>>>
>>>> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
>>>>
>>>> The client used was simple curl under mac os, and this crash did not
>>>> reproduce.
>>>> As you see, the packet received as list instead of binary is
>>>> client_hello which is the first message ssl process gets.
>>>> Maybe there is some race condition when changing mode on just
>>>> transport_accepted socket.
>>>>
>>>>
>>>>
>>>> Also, the same server when working with the same curl may stall at this
>>>> line:
>>>>   {ok, _} = ssl:recv(Socket, 0, 10000),
>>>> process_info showed that the worker has {ssl,#sslsocket{}, Data} in its
>>>> mailbox.
>>>>
>>>> I've tuned the server to reproduce this stall easier:
>>>> https://github.com/stolen/ssldump/blob/b3eebc2fa85ca0bd8ed8470c400b972aa737bffc/src/ssldump.erl(master may change)
>>>>
>>>> The client is manually invoked curl which is manually interrupted after
>>>> sufficiently large time of inactivity. The console log looks like this:
>>>>
>>>> stolen@REDACTED:~/ssldump$ ./dummyserver.sh 9999 ../cert/pipe.trololo
>>>> Erlang/OTP 17 [erts-6.0] [source-07b8f44] [smp:2:2] [async-threads:10]
>>>> [hipe] [kernel-poll:false]
>>>>
>>>> Eshell V6.0  (abort with ^G)
>>>> 1> Accepting SSL in <0.46.0>
>>>> Socket state: connection, options: {socket_options,list,0,0,0,true}
>>>>
>>>> =ERROR REPORT==== 17-Apr-2014::09:08:01 ===
>>>> Error in process <0.46.0> with exit value:
>>>> {{badmatch,{error,timeout}},[{ssldump,ssl_acceptor,1,[{file,"src/ssldump.erl"},{line,98}]}]}
>>>>
>>>> Accepting SSL in <0.47.0>
>>>> Socket state: connection, options: {socket_options,binary,0,0,0,false}
>>>> Accepting SSL in <0.48.0>
>>>> Socket state: connection, options: {socket_options,binary,0,0,0,false}
>>>> Accepting SSL in <0.51.0>
>>>> Socket state: connection, options: {socket_options,list,0,0,0,true}
>>>>
>>>> =ERROR REPORT==== 17-Apr-2014::09:09:17 ===
>>>> Error in process <0.51.0> with exit value:
>>>> {{badmatch,{error,closed}},[{ssldump,ssl_acceptor,1,[{file,"src/ssldump.erl"},{line,98}]}]}
>>>>
>>>> Accepting SSL in <0.53.0>
>>>> Socket state: connection, options: {socket_options,binary,0,0,0,false}
>>>> Accepting SSL in <0.55.0>
>>>> Socket state: connection, options: {socket_options,binary,0,0,0,false}
>>>> Accepting SSL in <0.57.0>
>>>> Socket state: connection, options: {socket_options,list,0,0,0,true}
>>>>
>>>> =ERROR REPORT==== 17-Apr-2014::09:11:26 ===
>>>> Error in process <0.57.0> with exit value:
>>>> {{badmatch,{error,closed}},[{ssldump,ssl_acceptor,1,[{file,"src/ssldump.erl"},{line,98}]}]}
>>>>
>>>>
>>>> My interpretation:
>>>>   - Given pool size X, the first of each X requests fails.
>>>>   - First of failing requests catches timeout as expected, other ones
>>>> just stall.
>>>>   - Failing requests have emulated options [{mode, list}, {active,
>>>> true}] while working ones have the opposite [{mode, binary}, {active,
>>>> false}].
>>>>
>>>> After couple of further experiments I discovered that only every (X*N +
>>>> 1)th acceptor inherits listening socket options, other ones always have
>>>> [{mode, binary}, {active, false}].
>>>>
>>>>
>>>>
>>> Humm...   I am not sure if this is what happens in your case but the
>>> following  should fix a potential inheritance problem
>>>
>>> diff --git a/lib/ssl/src/ssl.erl b/lib/ssl/src/ssl.erl
>>> index 743753b..7768bbb 100644
>>> --- a/lib/ssl/src/ssl.erl
>>> +++ b/lib/ssl/src/ssl.erl
>>> @@ -187,6 +187,7 @@ transport_accept(#sslsocket{pid = {ListenSocket,
>>>              {error, Reason}
>>>          end;
>>>      {error, Reason} ->
>>> +        ok = ssl_socket:setopts(Transport, ListenSocket, SocketValues),
>>>          {error, Reason}
>>>      end.
>>>
>>>
>>>
>>>
>>>> And, apart from eventual option inheritance, shouldn't ssl:recv on
>>>> active socket return {error, einval} like gen_tcp:recv does?
>>>>
>>>>
>>>>
>>> I think it should.
>>>
>>>
>>> Regards Ingela Erlang/OTP Team Ericssson AB
>>>
>>>
>>>
>>>
>>>
>>>>  2014-04-16 19:36 GMT+04:00 Ingela Andin <ingela.andin@REDACTED>:
>>>>
>>>> Hi!
>>>>>
>>>>> 2014-04-16 15:56 GMT+02:00 Danil Zagoskin <z@REDACTED>:
>>>>>
>>>>> Thank you for the patch. I'll check it.
>>>>>>
>>>>>
>>>>>
>>>>> New patch, better tested ;),  on branch
>>>>> ia/ssl/user-suites-match-negotiated-version at
>>>>> https://github.com/IngelaAndin/otp
>>>>>
>>>>>
>>>>>>
>>>>>> (Style question: is lists:filtermap better than list comprehension?)
>>>>>>
>>>>>
>>>>>
>>>>> I think just a matter of taste!
>>>>>
>>>>>
>>>>>> > #ssl_options{] is an internal representation of the proplists
>>>>>> input list and should not be a visible in the API, records are generally
>>>>>> not a good API choice as they enforce compile time dependencies.
>>>>>>
>>>>>> Yes, I understand that. I wanted to start broken client to see how
>>>>>> server would work with it.
>>>>>> The option to do that could be using sys:replace_state, but as far as
>>>>>> I understand the code, client_hello is sent immediately after starting a
>>>>>> process, so it seems there is no way to send crafted hello and continue
>>>>>> handshake with couple of lines instead of ssl:connect.
>>>>>>
>>>>>> Given that possibility it could be much easier to reproduce server
>>>>>> crashes when working with buggy clients.
>>>>>> For example, How do I write a test for badmatch in tls_record:decode_cipher_text?
>>>>>> The function takes #connection_states{}, and creating that record
>>>>>> relies on #ssl_options{} which cannot be created without opening a
>>>>>> socket and calling gen_fsm:enter_loop.
>>>>>>
>>>>>>
>>>>> Yes  you have a good point.  I do not have a good short answer.
>>>>> Testing error cases especially when the protocol sometimes
>>>>> tries to hide the real error reason due to security issues and
>>>>> shortcutting the API and creating raw protocol command is not trivial.
>>>>>
>>>>> Regards Ingela Erlang/OTP team - Ericsson AB
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> 2014-04-16 17:07 GMT+04:00 Ingela Andin <ingela.andin@REDACTED>:
>>>>>>
>>>>>> Hi!
>>>>>>>
>>>>>>> 2014-04-16 0:07 GMT+02:00 Danil Zagoskin <z@REDACTED>:
>>>>>>>
>>>>>>> Hi, Ingela.
>>>>>>>>
>>>>>>>> I finally had some time for investigating this problem.
>>>>>>>>
>>>>>>>> I didn't see badarg in erlang:size again, but it definitely
>>>>>>>> appeared in 17.0 with simple accept method described as minimal example at
>>>>>>>> http://www.erlang.org/doc/apps/ssl/using_ssl.html (
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> the difference was transport_accept and ssl_accept were run in
>>>>>>>> separate process)
>>>>>>>>
>>>>>>>
>>>>>>> That  should be ok, you are suppose to be able to do that.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Today I met some new crash:
>>>>>>>> ** {{badmatch,{alert,2,20,{"ssl_cipher.erl",215}}},
>>>>>>>>
>>>>>>>> [{tls_record,decode_cipher_text,2,[{file,"tls_record.erl"},{line,157}]},
>>>>>>>>
>>>>>>>>  {tls_connection,next_record,1,[{file,"tls_connection.erl"},{line,503}]},
>>>>>>>>
>>>>>>>>  {tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,475}]},
>>>>>>>>      {gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,503}]},
>>>>>>>>
>>>>>>>>  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
>>>>>>>>
>>>>>>>> That's because ssl_record:decipher/3 may return #alert{}, but
>>>>>>>> tls_record:decode_cipher_text/2 isn't expecting that.
>>>>>>>>
>>>>>>>>
>>>>>>> Hum, looks like a bug. We will look into that.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> As for {case_clause,{4}} in ssl_v3:mac_hash/3 I found that
>>>>>>>> precondition for crash is easily reproducible:
>>>>>>>> https://gist.github.com/stolen/10780653
>>>>>>>> Ordinary servers negotiate with sha hash or close socket while ssl
>>>>>>>> in 17.0 negotiates with sha256 hash.
>>>>>>>> I don't know the protocol good enough to perform further states,
>>>>>>>> but I hope this helps.
>>>>>>>>
>>>>>>>>
>>>>>>> Okej, the following patch should take care of that problem.
>>>>>>>
>>>>>>> diff --git a/lib/ssl/src/ssl.erl b/lib/ssl/src/ssl.erl
>>>>>>> index 743753b..093d4af 100644
>>>>>>> --- a/lib/ssl/src/ssl.erl
>>>>>>> +++ b/lib/ssl/src/ssl.erl
>>>>>>> @@ -953,8 +953,8 @@ handle_cipher_option(Value, Version)  when
>>>>>>> is_list(Value) ->
>>>>>>>      error:_->
>>>>>>>          throw({error, {options, {ciphers, Value}}})
>>>>>>>      end.
>>>>>>> -binary_cipher_suites(Version, []) -> %% Defaults to all supported
>>>>>>> suits
>>>>>>> -    ssl_cipher:suites(Version);
>>>>>>> +binary_cipher_suites(Version, []) ->
>>>>>>> +    [];
>>>>>>>  binary_cipher_suites(Version, [{_,_,_,_}| _] = Ciphers0) -> %%
>>>>>>> Backwards compatibility
>>>>>>>      Ciphers = [{KeyExchange, Cipher, Hash} || {KeyExchange, Cipher,
>>>>>>> Hash, _} <- Ciphers0],
>>>>>>>      binary_cipher_suites(Version, Ciphers);
>>>>>>> diff --git a/lib/ssl/src/ssl_handshake.erl
>>>>>>> b/lib/ssl/src/ssl_handshake.erl
>>>>>>> index 1108edc..93d4efd 100644
>>>>>>> --- a/lib/ssl/src/ssl_handshake.erl
>>>>>>> +++ b/lib/ssl/src/ssl_handshake.erl
>>>>>>> @@ -1017,11 +1017,14 @@ decode_suites('3_bytes', Dec) ->
>>>>>>>  %%-------------Cipeher suite handling
>>>>>>> --------------------------------
>>>>>>>
>>>>>>>  available_suites(UserSuites, Version) ->
>>>>>>> +    VersionSupportedSuites = ssl_cipher:suites(Version),
>>>>>>>      case UserSuites of
>>>>>>>      [] ->
>>>>>>> -        ssl_cipher:suites(Version);
>>>>>>> +        VersionSupportedSuites;
>>>>>>>      _ ->
>>>>>>> -        UserSuites
>>>>>>> +        lists:filtermap(fun(Suite) ->
>>>>>>> +                    lists:member(Suite, VersionSupportedSuites)
>>>>>>> +                end, UserSuites)
>>>>>>>      end.
>>>>>>>
>>>>>>>  available_suites(ServerCert, UserSuites, Version, Curve) ->
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Unfortunately, I don't understand how this code should be tested —
>>>>>>>> most of functions require #state{} or #connection_states{} which require
>>>>>>>> #ssl_options{} and other stuff, and at I have found no exported function
>>>>>>>> creating #ssl_options{} without side-effects.
>>>>>>>>
>>>>>>>
>>>>>>> #ssl_options{] is an internal representation of the proplists input
>>>>>>> list and should not be a visible in the API, records are generally not a
>>>>>>> good API choice as they enforce compile time dependencies.
>>>>>>>
>>>>>>>
>>>>>>> Regards Ingela - Erlang/OTP team - Ericsson AB
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> 2014-04-15 16:52 GMT+04:00 Ingela Andin <ingela.andin@REDACTED>:
>>>>>>>>
>>>>>>>> Hello again!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  sometimes badarg in
>>>>>>>>>> erlang:size([22,3,1,0,158,1,0,0,154,3,1,83,74|...]) at tls_record.erl:122.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Same symptom was recently reported on erlang-bugs and it turned
>>>>>>>>> out to be due to upgrading a gen_tcp socket in active mode. When upgrading
>>>>>>>>> a gen_tcp socket to an ssl socket it must be put in passive mode ({active,
>>>>>>>>> false}) before the client
>>>>>>>>> is allowed to start the handshake.  Normaly if the upgrade is
>>>>>>>>> negotiated this is not a problem for the server to set the option
>>>>>>>>> before signaling to the client to go ahead with  the handshake. If
>>>>>>>>> the upgrade is only performed on the server side instead of calling the ssl
>>>>>>>>> API ( some people may do that due to the previously lack of possibility to
>>>>>>>>> specify ssl options when calling ssl:ssl_accept with an "sslsocket", this
>>>>>>>>> is no longer the case in 17.0) the listen socket needs to be put in passive
>>>>>>>>> mode (listen options are inherited by the acceptsocket) to make sure that
>>>>>>>>> the it will work, otherwhise it will work sometimes and sometime result in
>>>>>>>>> the error above.
>>>>>>>>>
>>>>>>>>> Regards Ingela Erlang/OTP team - Ericsson AB
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Danil Zagoskin | z@REDACTED
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danil Zagoskin | z@REDACTED
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Danil Zagoskin | z@REDACTED
>>>>
>>>
>>>
>>
>
>
> --
> Danil Zagoskin | z@REDACTED
>



-- 
Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140418/56281663/attachment.htm>


More information about the erlang-questions mailing list