From paul.joseph.davis@REDACTED  Wed Dec  2 03:04:03 2015
From: paul.joseph.davis@REDACTED (Paul Davis)
Date: Tue, 1 Dec 2015 21:04:03 -0500
Subject: [erlang-bugs] Dirty schedulers and '-smp disable'
In-Reply-To: <CACSRAtXw33oELwFR4O4Ppycr3mgozTQOUwAR7Pjmm9JqUK+-Rw@mail.gmail.com>
References: <CACSRAtVk3+41S=8svBzEUTE95WEGy=7h6AYqUj2ghxFU7agw8A@mail.gmail.com>
 <CAO+zUOVGNrJctfPu_Kavy0OBnvuagQ0+FdgdmBhZT6sD--XXnQ@mail.gmail.com>
 <CACSRAtXw33oELwFR4O4Ppycr3mgozTQOUwAR7Pjmm9JqUK+-Rw@mail.gmail.com>
Message-ID: <CAJ_m3YBzPHTU4izuvN74ZWV+U5JLii1CubNHzxm2WQt7bY7DaQ@mail.gmail.com>

I just butted up against this as well. Testing some code on a single
core virtual box vm and wasn't used to smp being disabled by default.
I've reproduced the behavior exactly as Knut described.

One thing further that was also icky is that the ErlNifSysInfo struct
has the dirty_scheduler_support flag set to 1 even when dirty
schedulers don't work (due to smp being disabled on single core VMs).
Thus, if you want to be super duper certain you have to check that
smp_support is enabled as well. While not a terrible inconvenience
once you know about it, I definitely managed to spend two hours
figuring it out.


On Wed, Jul 1, 2015 at 4:25 AM, Knut Nesheim <knutin@REDACTED> wrote:
> Yes, your answer makes sense.
>
> Just to clarify, the VM has been built with smp support, but when it
> boots up on a single-core machine it doesn't enable smp because it
> only detects one logical processor. I was able to reproduce the "dirty
> nif stuck" problem with "erl -smp disable".
>
> Knut
>
> On Tue, Jun 30, 2015 at 5:06 PM, Steve Vinoski <vinoski@REDACTED> wrote:
>>
>>
>> On Tue, Jun 30, 2015 at 10:39 AM, Knut Nesheim <knutin@REDACTED> wrote:
>>>
>>> Dear list,
>>>
>>> I ran into unexpected behaviour in the following situation:
>>>
>>>  * OTP 18.0, compiled from the git tag with dirty schedulers enabled
>>>  * NIF with the ERL_NIF_DIRTY_JOB_CPU_BOUND flag
>>>  * Small machine with only one core (AWS t1.micro)
>>>  * The first log line from startup with no explicit flags looks like
>>> this: Erlang/OTP 18 [erts-7.0] [source] [64-bit] [async-threads:10]
>>> [hipe] [kernel-poll:false]
>>>
>>> When I call the NIF, the calling process hangs forever. When I call it
>>> from the shell, I'm unable to interrupt the process (C-g, i 1 does
>>> nothing useful).
>>>
>>> If I explicitly use '-smp enable' as arguments to erl, the NIF runs
>>> fine. In that case the first log line looks like this: Erlang/OTP 18
>>> [erts-7.0] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:10]
>>> [hipe] [kernel-poll:false]
>>>
>>> This behaviour got me a bit confused, as there is no indication what
>>> is happening except "something somewhere got stuck". It's not a common
>>> case for me, as most machines have multiple cores except tiny cloud
>>> instances or virtual machines.
>>
>>
>> The short answer is that currently, dirty schedulers always require SMP.
>>
>> The longer answer is that configure should raise an error if this
>> configuration is attempted. I can't recall for sure but I think it behaved
>> like this at one point, but a lot changed for Erlang 18 and so perhaps this
>> config check got lost along the way.
>>
>> --steve
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs


From vinoski@REDACTED  Wed Dec  2 14:53:03 2015
From: vinoski@REDACTED (Steve Vinoski)
Date: Wed, 2 Dec 2015 08:53:03 -0500
Subject: [erlang-bugs] Dirty schedulers and '-smp disable'
In-Reply-To: <CAJ_m3YBzPHTU4izuvN74ZWV+U5JLii1CubNHzxm2WQt7bY7DaQ@mail.gmail.com>
References: <CACSRAtVk3+41S=8svBzEUTE95WEGy=7h6AYqUj2ghxFU7agw8A@mail.gmail.com>
 <CAO+zUOVGNrJctfPu_Kavy0OBnvuagQ0+FdgdmBhZT6sD--XXnQ@mail.gmail.com>
 <CACSRAtXw33oELwFR4O4Ppycr3mgozTQOUwAR7Pjmm9JqUK+-Rw@mail.gmail.com>
 <CAJ_m3YBzPHTU4izuvN74ZWV+U5JLii1CubNHzxm2WQt7bY7DaQ@mail.gmail.com>
Message-ID: <CAO+zUOVLBRzqCHPS8PwH04NA22CgPFr_8eutQeR092mrqR+OTg@mail.gmail.com>

On Tue, Dec 1, 2015 at 9:04 PM, Paul Davis <paul.joseph.davis@REDACTED>
wrote:

> I just butted up against this as well. Testing some code on a single
> core virtual box vm and wasn't used to smp being disabled by default.
> I've reproduced the behavior exactly as Knut described.
>
> One thing further that was also icky is that the ErlNifSysInfo struct
> has the dirty_scheduler_support flag set to 1 even when dirty
> schedulers don't work (due to smp being disabled on single core VMs).
> Thus, if you want to be super duper certain you have to check that
> smp_support is enabled as well. While not a terrible inconvenience
> once you know about it, I definitely managed to spend two hours
> figuring it out.
>

That too seems like a bug, for now anyway.

Only the OTP team can authoritatively state the plans for dirty schedulers,
but I'm still involved in working on them and my understanding is there's a
push to get them out of experimental status and into regular feature status
for Erlang 19. Part of that push includes an effort to make them work even
if SMP is disabled.

--steve


>
>
> On Wed, Jul 1, 2015 at 4:25 AM, Knut Nesheim <knutin@REDACTED> wrote:
> > Yes, your answer makes sense.
> >
> > Just to clarify, the VM has been built with smp support, but when it
> > boots up on a single-core machine it doesn't enable smp because it
> > only detects one logical processor. I was able to reproduce the "dirty
> > nif stuck" problem with "erl -smp disable".
> >
> > Knut
> >
> > On Tue, Jun 30, 2015 at 5:06 PM, Steve Vinoski <vinoski@REDACTED> wrote:
> >>
> >>
> >> On Tue, Jun 30, 2015 at 10:39 AM, Knut Nesheim <knutin@REDACTED>
> wrote:
> >>>
> >>> Dear list,
> >>>
> >>> I ran into unexpected behaviour in the following situation:
> >>>
> >>>  * OTP 18.0, compiled from the git tag with dirty schedulers enabled
> >>>  * NIF with the ERL_NIF_DIRTY_JOB_CPU_BOUND flag
> >>>  * Small machine with only one core (AWS t1.micro)
> >>>  * The first log line from startup with no explicit flags looks like
> >>> this: Erlang/OTP 18 [erts-7.0] [source] [64-bit] [async-threads:10]
> >>> [hipe] [kernel-poll:false]
> >>>
> >>> When I call the NIF, the calling process hangs forever. When I call it
> >>> from the shell, I'm unable to interrupt the process (C-g, i 1 does
> >>> nothing useful).
> >>>
> >>> If I explicitly use '-smp enable' as arguments to erl, the NIF runs
> >>> fine. In that case the first log line looks like this: Erlang/OTP 18
> >>> [erts-7.0] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:10]
> >>> [hipe] [kernel-poll:false]
> >>>
> >>> This behaviour got me a bit confused, as there is no indication what
> >>> is happening except "something somewhere got stuck". It's not a common
> >>> case for me, as most machines have multiple cores except tiny cloud
> >>> instances or virtual machines.
> >>
> >>
> >> The short answer is that currently, dirty schedulers always require SMP.
> >>
> >> The longer answer is that configure should raise an error if this
> >> configuration is attempted. I can't recall for sure but I think it
> behaved
> >> like this at one point, but a lot changed for Erlang 18 and so perhaps
> this
> >> config check got lost along the way.
> >>
> >> --steve
> > _______________________________________________
> > erlang-bugs mailing list
> > erlang-bugs@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-bugs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151202/03cd5dab/attachment.htm>

From paul.joseph.davis@REDACTED  Wed Dec  2 18:49:36 2015
From: paul.joseph.davis@REDACTED (Paul Davis)
Date: Wed, 2 Dec 2015 12:49:36 -0500
Subject: [erlang-bugs] Dirty schedulers and '-smp disable'
In-Reply-To: <CAO+zUOVLBRzqCHPS8PwH04NA22CgPFr_8eutQeR092mrqR+OTg@mail.gmail.com>
References: <CACSRAtVk3+41S=8svBzEUTE95WEGy=7h6AYqUj2ghxFU7agw8A@mail.gmail.com>
 <CAO+zUOVGNrJctfPu_Kavy0OBnvuagQ0+FdgdmBhZT6sD--XXnQ@mail.gmail.com>
 <CACSRAtXw33oELwFR4O4Ppycr3mgozTQOUwAR7Pjmm9JqUK+-Rw@mail.gmail.com>
 <CAJ_m3YBzPHTU4izuvN74ZWV+U5JLii1CubNHzxm2WQt7bY7DaQ@mail.gmail.com>
 <CAO+zUOVLBRzqCHPS8PwH04NA22CgPFr_8eutQeR092mrqR+OTg@mail.gmail.com>
Message-ID: <CAJ_m3YBhXHXbn=ZVNGWKrKi_i_9r=JP5L-QD2aymVBmMy0V0SQ@mail.gmail.com>

That's fair. Although I don't care that they don't work on non-SMP
VMs. It was just that they failed in a non-obvious manner. For
instance, an error when loading a NIF that specifies a dirty scheduler
in an ErlNifFunc or when passing a dirty scheduler flag to
enif_schedule_nif would've probably been enough to point out the
issue.

On Wed, Dec 2, 2015 at 8:53 AM, Steve Vinoski <vinoski@REDACTED> wrote:
>
>
> On Tue, Dec 1, 2015 at 9:04 PM, Paul Davis <paul.joseph.davis@REDACTED>
> wrote:
>>
>> I just butted up against this as well. Testing some code on a single
>> core virtual box vm and wasn't used to smp being disabled by default.
>> I've reproduced the behavior exactly as Knut described.
>>
>> One thing further that was also icky is that the ErlNifSysInfo struct
>> has the dirty_scheduler_support flag set to 1 even when dirty
>> schedulers don't work (due to smp being disabled on single core VMs).
>> Thus, if you want to be super duper certain you have to check that
>> smp_support is enabled as well. While not a terrible inconvenience
>> once you know about it, I definitely managed to spend two hours
>> figuring it out.
>
>
> That too seems like a bug, for now anyway.
>
> Only the OTP team can authoritatively state the plans for dirty schedulers,
> but I'm still involved in working on them and my understanding is there's a
> push to get them out of experimental status and into regular feature status
> for Erlang 19. Part of that push includes an effort to make them work even
> if SMP is disabled.
>
> --steve
>
>
>>
>>
>>
>> On Wed, Jul 1, 2015 at 4:25 AM, Knut Nesheim <knutin@REDACTED> wrote:
>> > Yes, your answer makes sense.
>> >
>> > Just to clarify, the VM has been built with smp support, but when it
>> > boots up on a single-core machine it doesn't enable smp because it
>> > only detects one logical processor. I was able to reproduce the "dirty
>> > nif stuck" problem with "erl -smp disable".
>> >
>> > Knut
>> >
>> > On Tue, Jun 30, 2015 at 5:06 PM, Steve Vinoski <vinoski@REDACTED> wrote:
>> >>
>> >>
>> >> On Tue, Jun 30, 2015 at 10:39 AM, Knut Nesheim <knutin@REDACTED>
>> >> wrote:
>> >>>
>> >>> Dear list,
>> >>>
>> >>> I ran into unexpected behaviour in the following situation:
>> >>>
>> >>>  * OTP 18.0, compiled from the git tag with dirty schedulers enabled
>> >>>  * NIF with the ERL_NIF_DIRTY_JOB_CPU_BOUND flag
>> >>>  * Small machine with only one core (AWS t1.micro)
>> >>>  * The first log line from startup with no explicit flags looks like
>> >>> this: Erlang/OTP 18 [erts-7.0] [source] [64-bit] [async-threads:10]
>> >>> [hipe] [kernel-poll:false]
>> >>>
>> >>> When I call the NIF, the calling process hangs forever. When I call it
>> >>> from the shell, I'm unable to interrupt the process (C-g, i 1 does
>> >>> nothing useful).
>> >>>
>> >>> If I explicitly use '-smp enable' as arguments to erl, the NIF runs
>> >>> fine. In that case the first log line looks like this: Erlang/OTP 18
>> >>> [erts-7.0] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:10]
>> >>> [hipe] [kernel-poll:false]
>> >>>
>> >>> This behaviour got me a bit confused, as there is no indication what
>> >>> is happening except "something somewhere got stuck". It's not a common
>> >>> case for me, as most machines have multiple cores except tiny cloud
>> >>> instances or virtual machines.
>> >>
>> >>
>> >> The short answer is that currently, dirty schedulers always require
>> >> SMP.
>> >>
>> >> The longer answer is that configure should raise an error if this
>> >> configuration is attempted. I can't recall for sure but I think it
>> >> behaved
>> >> like this at one point, but a lot changed for Erlang 18 and so perhaps
>> >> this
>> >> config check got lost along the way.
>> >>
>> >> --steve
>> > _______________________________________________
>> > erlang-bugs mailing list
>> > erlang-bugs@REDACTED
>> > http://erlang.org/mailman/listinfo/erlang-bugs
>
>


From kenji@REDACTED  Thu Dec 17 04:10:39 2015
From: kenji@REDACTED (Kenji Rikitake)
Date: Thu, 17 Dec 2015 12:10:39 +0900
Subject: [erlang-bugs] OTP 18.2 HiPE fix on FreeBSD 10.2
Message-ID: <20151217031039.GA38951@k2r.org>

https://github.com/erlang/otp/pull/925

OTP 18.2 on FreeBSD 10.2-STABLE does not compile with HiPE enabled.
18.1.5 worked ok, so I guess the recent change for musl libc affected.
The following includes a quick workaround and I need FreeBSD people
to further test the HiPE functionalities. (Any good test cases?)

Regards,
Kenji Rikitake


From kenji@REDACTED  Thu Dec 17 13:12:29 2015
From: kenji@REDACTED (Kenji Rikitake)
Date: Thu, 17 Dec 2015 21:12:29 +0900
Subject: [erlang-bugs] OTP 18.2 HiPE fix on FreeBSD 10.2
In-Reply-To: <20151217031039.GA38951@k2r.org>
References: <20151217031039.GA38951@k2r.org>
Message-ID: <CANWtcNixq29mi-8BqcacgWRFHR4DsA+efjqsE6jLG3FYzCVJ9w@mail.gmail.com>

I mixed up patches for 18.2 and master branches. Here's the fixed one for
18.2:

https://github.com/erlang/otp/pull/926

Kenji

On Thu, Dec 17, 2015 at 12:10 PM, Kenji Rikitake <kenji@REDACTED> wrote:

> https://github.com/erlang/otp/pull/925
>
> OTP 18.2 on FreeBSD 10.2-STABLE does not compile with HiPE enabled.
> 18.1.5 worked ok, so I guess the recent change for musl libc affected.
> The following includes a quick workaround and I need FreeBSD people
> to further test the HiPE functionalities. (Any good test cases?)
>
> Regards,
> Kenji Rikitake
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151217/ef7542a0/attachment.htm>

From kenji@REDACTED  Fri Dec 18 04:57:18 2015
From: kenji@REDACTED (Kenji Rikitake)
Date: Fri, 18 Dec 2015 12:57:18 +0900
Subject: [erlang-bugs] OTP 18.2 HiPE fix on FreeBSD 10.2
In-Reply-To: <1450360334.676610.470067657.034227CC@webmail.messagingengine.com>
References: <20151217031039.GA38951@k2r.org>
 <CANWtcNixq29mi-8BqcacgWRFHR4DsA+efjqsE6jLG3FYzCVJ9w@mail.gmail.com>
 <1450360334.676610.470067657.034227CC@webmail.messagingengine.com>
Message-ID: <CANWtcNi3Eab3HgziXMiOEKFJU_-Dmvuj9Xv=n8-USvA_7cL5xg@mail.gmail.com>

Kawano-san: very much appreciated.

I've tested with --enable-hipe --enable-fp-exceptions --enable-native-libs
and so far the BEAM with HiPE seems to be working.

I have to check out the following issues:
* Is the sigaction() handling really OK on FreeBSD?
* Is the dlsym() handling really OK on FreeBSD?
Maybe I need more input from FreeBSD people.

For those who want to test a tentative Port, check here:
https://github.com/jj1bdx/erlang-freebsd-port/tree/18.2-20151218
though I'm sure Jimmy Olgeni, the maintainer of FreeBSD Erlang Ports,
will override mine in a short period.

Regards,
Kenji Rikitake


On Thu, Dec 17, 2015 at 10:52 PM, Tatsuya Kawano <tatsuya@REDACTED>
wrote:

> Hi Kenji,
>
> On Thu, Dec 17, 2015, at 08:12 PM CST, Kenji Rikitake wrote:
> >> The following includes a quick workaround and I need FreeBSD people
> >> to further test the HiPE functionalities. (Any good test cases?)
> ...
> > I mixed up patches for 18.2 and master branches. Here's the fixed one
> for 18.2:
> >
> > https://github.com/erlang/otp/pull/926
>
> Thank you for the patch. It worked like a charm; I was able to build OTP
> 18.2 on FreeBSD 10.2 with HiPE enabled.
>
> So far, I have only tested it against boundary bear
> <https://github.com/boundary/bear>, which has HiPE enabled by default.
> It passed all eunit cases.
>
> --------------------------------------------------
> /home/tatsuya% freebsd-version
> 10.2-RELEASE-p8
>
> /home/tatsuya% cat .kerlrc
> KERL_CONFIGURE_OPTIONS="--enable-hipe --enable-smp-support
> --enable-threads --enable-kernel-poll"
>
> /home/tatsuya% kerl build git https://github.com/jj1bdx/otp.git \
>     jj1bdx-18.2-freebsd-hipe-fix-2 18.2_hipe_pr926
> Checking Erlang/OTP git repository from
> https://github.com/jj1bdx/otp.git...
> Building Erlang/OTP 18.2_hipe_pr926 from git, please wait...
> Erlang/OTP 18.2_hipe_pr926 from git has been successfully built
>
> /home/tatsuya% kerl install 18.2_hipe_pr926 ~/erlang/18.2_hipe_pr926
> Installing Erlang/OTP git (18.2_hipe_pr926) in
> /home/tatsuya/erlang/18.2_hipe_pr926...
> You can activate this installation running the following command:
> . /home/tatsuya/erlang/18.2_hipe_pr926/activate
> Later on, you can leave the installation typing:
> kerl_deactivate
>
> /home/tatsuya% . /home/tatsuya/erlang/18.2_hipe_pr926/activate
> /home/tatsuya% erl
> Erlang/OTP 18 [erts-7.2] [source-e616e04] [64-bit] [smp:8:8]
> [async-threads:10] [hipe] [kernel-poll:false]
>
> Eshell V7.2  (abort with ^G)
> 1>
> User switch command
>  --> q
>
> /home/tatsuya% cd workhub/dev/hibari/hibari/lib/bear/
> /home/tatsuya/workhub/dev/hibari/hibari/lib/bear% grep native src/*
> src/bear.erl:-compile([native]).
>
> /home/tatsuya/workhub/dev/hibari/hibari/lib/bear% ./rebar clean compile
> eunit
> ==> bear (clean)
> ==> bear (compile)
> Compiled src/bear.erl
> ==> bear (eunit)
> Compiled test/bear_test.erl
> Compiled src/bear.erl
>   All 47 tests passed.
> Cover analysis:
> /usr/home/tatsuya/workhub/dev/hibari/hibari/lib/bear/.eunit/index.html
> --------------------------------------------------
>
> Thanks,
> Tatsuya Kawano
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151218/7a3eaf30/attachment.htm>

From tatsuya@REDACTED  Thu Dec 17 15:05:09 2015
From: tatsuya@REDACTED (Tatsuya Kawano)
Date: Thu, 17 Dec 2015 22:05:09 +0800
Subject: [erlang-bugs] OTP 18.2 HiPE fix on FreeBSD 10.2
In-Reply-To: <CANWtcNixq29mi-8BqcacgWRFHR4DsA+efjqsE6jLG3FYzCVJ9w@mail.gmail.com>
References: <20151217031039.GA38951@k2r.org>
 <CANWtcNixq29mi-8BqcacgWRFHR4DsA+efjqsE6jLG3FYzCVJ9w@mail.gmail.com>
Message-ID: <1450361109.678781.470088585.009E7C7E@webmail.messagingengine.com>

Hi Kenji,

On Thu, Dec 17, 2015, at 08:12 PM CST, Kenji Rikitake wrote:
>> The following includes a quick workaround and I need FreeBSD people
>> to further test the HiPE functionalities. (Any good test cases?)
> ...
> I mixed up patches for 18.2 and master branches. Here's the fixed one for 18.2:
> 
> https://github.com/erlang/otp/pull/926

Thank you for the patch. It worked like a charm; I was able to build OTP
18.2 on FreeBSD 10.2 with HiPE enabled.

So far, I have only tested it against boundary bear
<https://github.com/boundary/bear>, which has HiPE enabled by default.
It passed all eunit cases.

--------------------------------------------------
/home/tatsuya% freebsd-version
10.2-RELEASE-p8

/home/tatsuya% cat .kerlrc 
KERL_CONFIGURE_OPTIONS="--enable-hipe --enable-smp-support
--enable-threads --enable-kernel-poll"

/home/tatsuya% kerl build git https://github.com/jj1bdx/otp.git \
    jj1bdx-18.2-freebsd-hipe-fix-2 18.2_hipe_pr926
Checking Erlang/OTP git repository from
https://github.com/jj1bdx/otp.git...
Building Erlang/OTP 18.2_hipe_pr926 from git, please wait...
Erlang/OTP 18.2_hipe_pr926 from git has been successfully built

/home/tatsuya% kerl install 18.2_hipe_pr926 ~/erlang/18.2_hipe_pr926
Installing Erlang/OTP git (18.2_hipe_pr926) in
/home/tatsuya/erlang/18.2_hipe_pr926...
You can activate this installation running the following command:
. /home/tatsuya/erlang/18.2_hipe_pr926/activate
Later on, you can leave the installation typing:
kerl_deactivate

/home/tatsuya% . /home/tatsuya/erlang/18.2_hipe_pr926/activate
/home/tatsuya% erl
Erlang/OTP 18 [erts-7.2] [source-e616e04] [64-bit] [smp:8:8]
[async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.2  (abort with ^G)
1>
User switch command
 --> q

/home/tatsuya% cd workhub/dev/hibari/hibari/lib/bear/
/home/tatsuya/workhub/dev/hibari/hibari/lib/bear% grep native src/*
src/bear.erl:-compile([native]).

/home/tatsuya/workhub/dev/hibari/hibari/lib/bear% ./rebar clean compile
eunit
==> bear (clean)
==> bear (compile)
Compiled src/bear.erl
==> bear (eunit)
Compiled test/bear_test.erl
Compiled src/bear.erl
  All 47 tests passed.
Cover analysis:
/usr/home/tatsuya/workhub/dev/hibari/hibari/lib/bear/.eunit/index.html
--------------------------------------------------

Thanks,
Tatsuya Kawano


From isreal-erlang-bugs-at-erlang.org@REDACTED  Fri Dec 18 15:19:45 2015
From: isreal-erlang-bugs-at-erlang.org@REDACTED (David Buckley)
Date: Fri, 18 Dec 2015 14:19:45 +0000
Subject: [erlang-bugs] NIF .so reload issues
Message-ID: <20151218141945.GA3897@cirno.bucko.me.uk>

Hi! I was playing with writing a NIF, and found I couldn't reload.

I'm doing the sort-of accepted thing of loading the nif in an on_load
function, though if I just execute the function just after load, I get
the same behaviour, so I don't think that's at issue.

Basically, what seems to be the case is that while erlang will
re-initialise my nif code (with 'upgrade'), it won't load a /new/
version of the nif code unless I completely purge the (erlang) code from
the runtime, forcing erlang to recheck the module. I'm guessing erlang
is caching the nif. Changing the compiled (.so) filename each time fixes
the problem.


Example code here:

https://gist.github.com/bucko909/a3b5099c74bf267e65db

test_reload_post_purge and test_reload_post_reload_complete_purge work
fine (erts-7.1), but the other three don't reload the .so file as I
would expect.


Is this fixable, or must I manually add a purge() in my init() function
before load_nif? (And why does that work? Because at that point there's
no evidence that the new module will have a load_nif, so the old dlopen
can be discarded?)

Seems like in general if the .so file has changed and a module is
reloaded, the user probably wants the new .so file, too! It's at least
worth adding a note to the docs (or a new return value?) if it's an evil
dlopen restriction.

-- 
David Buckley


From sverker.eriksson@REDACTED  Fri Dec 18 17:41:24 2015
From: sverker.eriksson@REDACTED (Sverker Eriksson)
Date: Fri, 18 Dec 2015 17:41:24 +0100
Subject: [erlang-bugs] NIF .so reload issues
In-Reply-To: <20151218141945.GA3897@cirno.bucko.me.uk>
References: <20151218141945.GA3897@cirno.bucko.me.uk>
Message-ID: <56743734.3040509@ericsson.com>

Hi David,

Yes, this is a dlopen restriction and also an ambiguity as I've heard
different behaviour reported depending on OS.

My Linux man page for dlopen says "If the same library is loaded again 
with dlopen(),
the same file handle is returned". But it does not specify what "the 
same" actually means.

The Erlang VM has to keep the old .so file loaded until the module is 
safely purged [*]
as there may exist Erlang processes still lingering in the old code. 
Trying to execute
unloaded native code does not behave well.

When you call load_nif with the same library name (as the
not yet purged one), dlopen thinks it's "the same" library
and just returns the same handle again.

What to do?

Rename the .so library, give it a version number. Or maybe
put it in a different directory will work (?).

Add something about this problem to the erl_nif docs. Yes that would be 
nice.

I'm hesitant to recommend purging in on_load. The on_load feature
is still experimental and we have some known problems with bad
behaviour, especially in the error cases when on_load fails.
To fix that we may have to limit what you are allowed
to do in on_load and code purging might be such a limitation.


[*] Purging may actually not be enough. If the NIF library has created
resource objects with a destructor callcack, it will not be unloaded until
the last resource objects has been garbage collected.

/Sverker, Erlang/OTP


On 12/18/2015 03:19 PM, David Buckley wrote:
> Hi! I was playing with writing a NIF, and found I couldn't reload.
>
> I'm doing the sort-of accepted thing of loading the nif in an on_load
> function, though if I just execute the function just after load, I get
> the same behaviour, so I don't think that's at issue.
>
> Basically, what seems to be the case is that while erlang will
> re-initialise my nif code (with 'upgrade'), it won't load a /new/
> version of the nif code unless I completely purge the (erlang) code from
> the runtime, forcing erlang to recheck the module. I'm guessing erlang
> is caching the nif. Changing the compiled (.so) filename each time fixes
> the problem.
>
>
> Example code here:
>
> https://gist.github.com/bucko909/a3b5099c74bf267e65db
>
> test_reload_post_purge and test_reload_post_reload_complete_purge work
> fine (erts-7.1), but the other three don't reload the .so file as I
> would expect.
>
>
> Is this fixable, or must I manually add a purge() in my init() function
> before load_nif? (And why does that work? Because at that point there's
> no evidence that the new module will have a load_nif, so the old dlopen
> can be discarded?)
>
> Seems like in general if the .so file has changed and a module is
> reloaded, the user probably wants the new .so file, too! It's at least
> worth adding a note to the docs (or a new return value?) if it's an evil
> dlopen restriction.
>


From isreal-erlang-bugs-at-erlang.org@REDACTED  Fri Dec 18 18:07:33 2015
From: isreal-erlang-bugs-at-erlang.org@REDACTED (David Buckley)
Date: Fri, 18 Dec 2015 17:07:33 +0000
Subject: [erlang-bugs] NIF .so reload issues
In-Reply-To: <56743734.3040509@ericsson.com>
References: <20151218141945.GA3897@cirno.bucko.me.uk>
 <56743734.3040509@ericsson.com>
Message-ID: <20151218170733.GA10347@cirno.bucko.me.uk>

On Fri, Dec 18, 2015 at 05:41:24PM +0100, Sverker Eriksson wrote:
> Hi David,
> 
> Yes, this is a dlopen restriction and also an ambiguity as I've heard
> different behaviour reported depending on OS.
> 
> My Linux man page for dlopen says "If the same library is loaded again with
> dlopen(),
> the same file handle is returned". But it does not specify what "the same"
> actually means.
> 
> The Erlang VM has to keep the old .so file loaded until the module is safely
> purged [*]
> as there may exist Erlang processes still lingering in the old code. Trying
> to execute
> unloaded native code does not behave well.
> 
> When you call load_nif with the same library name (as the
> not yet purged one), dlopen thinks it's "the same" library
> and just returns the same handle again.
> 
> What to do?
> 
> Rename the .so library, give it a version number. Or maybe
> put it in a different directory will work (?).
> 
> Add something about this problem to the erl_nif docs. Yes that would be
> nice.
> 
> I'm hesitant to recommend purging in on_load. The on_load feature
> is still experimental and we have some known problems with bad
> behaviour, especially in the error cases when on_load fails.
> To fix that we may have to limit what you are allowed
> to do in on_load and code purging might be such a limitation.
> 
> 
> [*] Purging may actually not be enough. If the NIF library has created
> resource objects with a destructor callcack, it will not be unloaded until
> the last resource objects has been garbage collected.

Hmmm, I was going to create resources!

I guess for development I'll add a hack that just creates a link to the
file with a temporary name before loading it, so that a new handle to it
is created each time. There /is/ a secret RTLD_PRIVATE flag for dlopen
-- that is not apparently supported on any OS mentioned on the first
page of google -- to get a private instance.

For production, versioning the library code ought to be fine. Most
system libraries already contain version numbers in the filename, and I
suppose this is part of why. It's only reloading for rapid development
that is causing pain here!

Is the old dlopen bound to the old (Erlang) code? That is, if I
instigate this hack, and leak resources somehow while reloading often,
will I have problems reloading the module, cause processes to be
violently uprooted as with purge, or simply leak dlopen handles until I
clean up?

Is there any chance of purge/soft_purge being extended to cover nif
resources?

-- 
David Buckley


From sverker.eriksson@REDACTED  Fri Dec 18 18:55:27 2015
From: sverker.eriksson@REDACTED (Sverker Eriksson)
Date: Fri, 18 Dec 2015 18:55:27 +0100
Subject: [erlang-bugs] NIF .so reload issues
In-Reply-To: <20151218170733.GA10347@cirno.bucko.me.uk>
References: <20151218141945.GA3897@cirno.bucko.me.uk>
 <56743734.3040509@ericsson.com> <20151218170733.GA10347@cirno.bucko.me.uk>
Message-ID: <5674488F.2080809@ericsson.com>


On 12/18/2015 06:07 PM, David Buckley wrote:
> On Fri, Dec 18, 2015 at 05:41:24PM +0100, Sverker Eriksson wrote:
>> Hi David,
>>
>> Yes, this is a dlopen restriction and also an ambiguity as I've heard
>> different behaviour reported depending on OS.
>>
>> My Linux man page for dlopen says "If the same library is loaded again with
>> dlopen(),
>> the same file handle is returned". But it does not specify what "the same"
>> actually means.
>>
>> The Erlang VM has to keep the old .so file loaded until the module is safely
>> purged [*]
>> as there may exist Erlang processes still lingering in the old code. Trying
>> to execute
>> unloaded native code does not behave well.
>>
>> When you call load_nif with the same library name (as the
>> not yet purged one), dlopen thinks it's "the same" library
>> and just returns the same handle again.
>>
>> What to do?
>>
>> Rename the .so library, give it a version number. Or maybe
>> put it in a different directory will work (?).
>>
>> Add something about this problem to the erl_nif docs. Yes that would be
>> nice.
>>
>> I'm hesitant to recommend purging in on_load. The on_load feature
>> is still experimental and we have some known problems with bad
>> behaviour, especially in the error cases when on_load fails.
>> To fix that we may have to limit what you are allowed
>> to do in on_load and code purging might be such a limitation.
>>
>>
>> [*] Purging may actually not be enough. If the NIF library has created
>> resource objects with a destructor callcack, it will not be unloaded until
>> the last resource objects has been garbage collected.
> Hmmm, I was going to create resources!
>
> I guess for development I'll add a hack that just creates a link to the
> file with a temporary name before loading it, so that a new handle to it
> is created each time.
I'm not sure dlopen is fooled by a link. You may need to make a real copy.

> There /is/ a secret RTLD_PRIVATE flag for dlopen
> -- that is not apparently supported on any OS mentioned on the first
> page of google -- to get a private instance.
>
> For production, versioning the library code ought to be fine. Most
> system libraries already contain version numbers in the filename, and I
> suppose this is part of why. It's only reloading for rapid development
> that is causing pain here!
>
> Is the old dlopen bound to the old (Erlang) code?
Yes. A NIF library is like an extension of the Erlang code
that loaded it.

> That is, if I
> instigate this hack, and leak resources somehow while reloading often,
> will I have problems reloading the module, cause processes to be
> violently uprooted as with purge, or simply leak dlopen handles until I
> clean up?
If you leak resources then you will also leak the loaded libraries
that contain the destructor functions  of those resources.

However, there is way for your upgraded NIF library to take over
ownership of existing resources by passing the ERL_NIF_RT_TAKEOVER
flag to enif_open_resource_type(). By doing that, the destructor in your
new library will be called instead and the old library can be unloaded
when the module is purged. Your new library versions must of course
be data compatible and know how to handle old resources.

>
> Is there any chance of purge/soft_purge being extended to cover nif
> resources?
>
Oh, that's a good question. Why don't we do that already?
I have to think about that.


/Sverker, Erlang/OTP


From isreal-erlang-bugs-at-erlang.org@REDACTED  Sun Dec 20 20:48:42 2015
From: isreal-erlang-bugs-at-erlang.org@REDACTED (David Buckley)
Date: Sun, 20 Dec 2015 19:48:42 +0000
Subject: [erlang-bugs] NIF resources are not checked on module unload
Message-ID: <20151220194841.GA18879@cirno.bucko.me.uk>

While playing with implementing a NIF, I found some segfaults, and I
eventually got it down to the test case here:

https://gist.github.com/bucko909/a841c716ede6d3903a13

It looks like it's down to my not re-registering the resource on upgrade
(presumably the handle goes stale, is garbage collected, and eventually
it corrupts memory causing segfaults in unrelated emulator code).

I fell into this trap by using code from
https://github.com/davisp/nif-examples -- which I've sent a pull request
to fix.

I fixed my problem by adding enif_open_resource to the upgrade function
once I'd clocked my error, so under normal and correct use, I think the
emulator is doing OK.

However, it looks like if I /don't/ reopen it, it's not properly
deleted, and the documentation seems to leave open the possibility of
doing just this ("Existing resource objects, of a module that is
upgraded, must either be deleted or taken over by the new NIF library").
References to resources with the old handle remain uncleaned. Even if I
completely destroy the old module, so that unload is called, these stale
resources persist until a garbage collection. They actually survive
/many/ purge/load cycles in my example code before being garbage
collected and segfaulting the emulator.


Ideas, based on my interpretation of the bug:

If there are lingering resources, which are not TAKEOVER-ed in the
upgrade function, and have a dtor, this should cause an immediate
emulator panic. I can't think of any other behaviour which is safe here.
If they don't have a dtor, it seems safe to keep them around, but their
resource handle needs to be kept alive until they are all destroyed. It
ought to be impossible to create new resources using the old handle, at
least when there is a dtor defined (can a 'dead' flag be set?).

Knowing this behaviour, an application author writing an upgrade
function for this NIF library might at least attempt to destroy all of
his objects when making such an upgrade, in order to have the emulator
survive!


Another approach is to require an /explicit/ delete of old resources,
perhaps simply a call to "enif_delete_unused_resources" or an iteration
of "enif_delete_resource" over "enif_list_resources", and have this call
fail where the old resources are still allocated. Perhaps the library
author could force a purge or panic the emulator themselves at this
point. The emulator should panic if a resource is neither deleted nor
reopened with TAKEOVER.

-- 
David Buckley


From ulf@REDACTED  Wed Dec 23 18:38:10 2015
From: ulf@REDACTED (Ulf Wiger)
Date: Wed, 23 Dec 2015 18:38:10 +0100
Subject: [erlang-bugs] SSL handshake crash
Message-ID: <CADPRLo_vd1dcgYD1RGNtCdQJVdgRHfSe1EuJO0B1h+tbMv3HAg@mail.gmail.com>

Hmm? I send this to erlang-bugs, but it didn?t seem to get through.

When connecting some Android software to an Erlang node using TLS, we
sometimes (about 1 in 3 or 4 times) get the following errors:

2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state hello
terminated with reason: no function clause matching
ssl_handshake:update_handshake_history(undefined,
<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
line 450

15:31:00.783<dlink_tls_conn/327>dlink_tls_conn:terminate(): Reason:
{{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}}

2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0>
with 0 neighbours exited with reason: no function clause matching
ssl_handshake:update_handshake_history(undefined,
<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
line 450 in gen_fsm:terminate/7 line 626

2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> terminated
with reason:
{{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...}
in gen_fsm:sync_send_all_state_event/3 line 257

2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0>
with 0 neighbours exited with reason:
{{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...}
in gen_server:terminate/7 line 826

2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup had
child undefined started with {tls_connection,start_link,undefined} at
<0.210.0> exit with reason no function clause
matching ssl_handshake:update_handshake_history(undefined,
<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
line 450 in context child_terminated


We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side has
the following options:

[{verify,verify_peer},
{certfile,"/home/.../device_cert.crt?},
{keyfile,"/home/.../device_key.pem?},
{cacertfile,"/home/.../root_cert.crt?},
{verify_fun,{#Fun<dlink_tls_conn.65.24728257>,{'RSAPublicKey?,...}}},
{partial_chain,#Fun<dlink_tls_conn.64.24728257>}]

Basically, the verify_fun validates a self-signed cert
https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393

and the partial_chain fun most likely does much less than it should
https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421

On the Android side, we?re using Android 4.4.2 (API 19).

It feels like a timing-related problem on the erlang side.

Let me know if you need more information.

BR,
Ulf W
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151223/d69c60f9/attachment.htm>

From z@REDACTED  Thu Dec 24 10:22:24 2015
From: z@REDACTED (Danil Zagoskin)
Date: Thu, 24 Dec 2015 12:22:24 +0300
Subject: [erlang-bugs] SSL handshake crash
In-Reply-To: <CADPRLo_vd1dcgYD1RGNtCdQJVdgRHfSe1EuJO0B1h+tbMv3HAg@mail.gmail.com>
References: <CADPRLo_vd1dcgYD1RGNtCdQJVdgRHfSe1EuJO0B1h+tbMv3HAg@mail.gmail.com>
Message-ID: <CAJ6dJEj6rZFiF4=dq34ugASneKVTuo_ssQHOU-Us2hmcAZCAnw@mail.gmail.com>

Hi!

I have the same issue, but not so often.
It seems to appear only when upgrading plain socket to TLS (XMPP starttls
in my case).

Possibly it's some kind of race condition when client sends TLS hello
before server does ssl_accept(). Maybe some active/passive socket mode
issue.

If you control the client code, could you add some sleep before starttls
and check if that fixes the issue?

On Wed, Dec 23, 2015 at 8:38 PM, Ulf Wiger <ulf@REDACTED> wrote:

> Hmm? I send this to erlang-bugs, but it didn?t seem to get through.
>
> When connecting some Android software to an Erlang node using TLS, we
> sometimes (about 1 in 3 or 4 times) get the following errors:
>
> 2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state hello
> terminated with reason: no function clause matching
> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
> line 450
>
> 15:31:00.783<dlink_tls_conn/327>dlink_tls_conn:terminate(): Reason:
> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}}
>
> 2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0>
> with 0 neighbours exited with reason: no function clause matching
> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
> line 450 in gen_fsm:terminate/7 line 626
>
> 2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> terminated
> with reason:
> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...}
> in gen_fsm:sync_send_all_state_event/3 line 257
>
> 2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0>
> with 0 neighbours exited with reason:
> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...}
> in gen_server:terminate/7 line 826
>
> 2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup
> had child undefined started with {tls_connection,start_link,undefined} at
> <0.210.0> exit with reason no function clause
> matching ssl_handshake:update_handshake_history(undefined,
> <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
> line 450 in context child_terminated
>
>
> We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side has
> the following options:
>
> [{verify,verify_peer},
> {certfile,"/home/.../device_cert.crt?},
> {keyfile,"/home/.../device_key.pem?},
> {cacertfile,"/home/.../root_cert.crt?},
> {verify_fun,{#Fun<dlink_tls_conn.65.24728257>,{'RSAPublicKey?,...}}},
> {partial_chain,#Fun<dlink_tls_conn.64.24728257>}]
>
> Basically, the verify_fun validates a self-signed cert
>
> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393
>
> and the partial_chain fun most likely does much less than it should
>
> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421
>
> On the Android side, we?re using Android 4.4.2 (API 19).
>
> It feels like a timing-related problem on the erlang side.
>
> Let me know if you need more information.
>
> BR,
> Ulf W
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
>


-- 
Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151224/3726bb75/attachment.htm>

From Ingela.Anderton.Andin@REDACTED  Fri Dec 25 12:54:58 2015
From: Ingela.Anderton.Andin@REDACTED (Ingela Anderton Andin)
Date: Fri, 25 Dec 2015 11:54:58 +0000
Subject: [erlang-bugs] SSL handshake crash
In-Reply-To: <CAJ6dJEj6rZFiF4=dq34ugASneKVTuo_ssQHOU-Us2hmcAZCAnw@mail.gmail.com>
References: <CADPRLo_vd1dcgYD1RGNtCdQJVdgRHfSe1EuJO0B1h+tbMv3HAg@mail.gmail.com>,
 <CAJ6dJEj6rZFiF4=dq34ugASneKVTuo_ssQHOU-Us2hmcAZCAnw@mail.gmail.com>
Message-ID: <B3CF142BE0AC334585C9F2076B18F8D2532C1B4E@ESESSMB205.ericsson.se>

Hi!


>From ssl users guide


 "Ensure active is set to false before trying to upgrade a connection to an SSL connection, otherwise SSL handshake messages can be delivered to the wrong process."


Regards Ingela Erlang/OTP team - Ericsson AB

________________________________
Fr?n: erlang-bugs-bounces@REDACTED [erlang-bugs-bounces@REDACTED] f?r Danil Zagoskin [z@REDACTED]
Skickat: den 24 december 2015 10:22
Till: Ulf Wiger
Kopia: erlang-bugs@REDACTED
?mne: Re: [erlang-bugs] SSL handshake crash

Hi!

I have the same issue, but not so often.
It seems to appear only when upgrading plain socket to TLS (XMPP starttls in my case).

Possibly it's some kind of race condition when client sends TLS hello before server does ssl_accept(). Maybe some active/passive socket mode issue.

If you control the client code, could you add some sleep before starttls and check if that fixes the issue?

On Wed, Dec 23, 2015 at 8:38 PM, Ulf Wiger <ulf@REDACTED<mailto:ulf@REDACTED>> wrote:
Hmm? I send this to erlang-bugs, but it didn?t seem to get through.

When connecting some Android software to an Erlang node using TLS, we sometimes (about 1 in 3 or 4 times) get the following errors:

2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state hello terminated with reason: no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450

15:31:00.783<dlink_tls_conn/327>dlink_tls_conn:terminate(): Reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}}

2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0> with 0 neighbours exited with reason: no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450 in gen_fsm:terminate/7 line 626

2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> terminated with reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} in gen_fsm:sync_send_all_state_event/3 line 257

2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0> with 0 neighbours exited with reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} in gen_server:terminate/7 line 826

2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup had child undefined started with {tls_connection,start_link,undefined} at <0.210.0> exit with reason no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450 in context child_terminated


We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side has the following options:

[{verify,verify_peer},
{certfile,"/home/.../device_cert.crt?},
{keyfile,"/home/.../device_key.pem?},
{cacertfile,"/home/.../root_cert.crt?},
{verify_fun,{#Fun<dlink_tls_conn.65.24728257>,{'RSAPublicKey?,...}}},
{partial_chain,#Fun<dlink_tls_conn.64.24728257>}]

Basically, the verify_fun validates a self-signed cert
https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393

and the partial_chain fun most likely does much less than it should
https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421

On the Android side, we?re using Android 4.4.2 (API 19).

It feels like a timing-related problem on the erlang side.

Let me know if you need more information.

BR,
Ulf W

_______________________________________________
erlang-bugs mailing list
erlang-bugs@REDACTED<mailto:erlang-bugs@REDACTED>
http://erlang.org/mailman/listinfo/erlang-bugs


--
Danil Zagoskin | z@REDACTED<mailto:z@REDACTED>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151225/b74c8242/attachment.htm>

From ulf@REDACTED  Fri Dec 25 21:40:09 2015
From: ulf@REDACTED (Ulf Wiger)
Date: Fri, 25 Dec 2015 21:40:09 +0100
Subject: [erlang-bugs] SSL handshake crash
In-Reply-To: <B3CF142BE0AC334585C9F2076B18F8D2532C1B4E@ESESSMB205.ericsson.se>
References: <CADPRLo_vd1dcgYD1RGNtCdQJVdgRHfSe1EuJO0B1h+tbMv3HAg@mail.gmail.com>
 <CAJ6dJEj6rZFiF4=dq34ugASneKVTuo_ssQHOU-Us2hmcAZCAnw@mail.gmail.com>
 <B3CF142BE0AC334585C9F2076B18F8D2532C1B4E@ESESSMB205.ericsson.se>
Message-ID: <CADPRLo-YmQofubWL6KiGfnm3NBYcQamfgJG=hs3Nc+GTtBDSFg@mail.gmail.com>

Hi Ingela,

'active' should be set to false:

https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L346

BR,
Ulf W

2015-12-25 12:54 GMT+01:00 Ingela Anderton Andin <
Ingela.Anderton.Andin@REDACTED>:

> Hi!
>
>
>
> From ssl users guide
>
>
>
>  "Ensure active is set to false before trying to upgrade a connection to
> an SSL connection, otherwise SSL handshake messages can be delivered to the
> wrong process."
>
>
>
> Regards Ingela Erlang/OTP team - Ericsson AB
> ------------------------------
> *Fr?n:* erlang-bugs-bounces@REDACTED [erlang-bugs-bounces@REDACTED]
> f?r Danil Zagoskin [z@REDACTED]
> *Skickat:* den 24 december 2015 10:22
> *Till:* Ulf Wiger
> *Kopia:* erlang-bugs@REDACTED
> *?mne:* Re: [erlang-bugs] SSL handshake crash
>
> Hi!
>
> I have the same issue, but not so often.
> It seems to appear only when upgrading plain socket to TLS (XMPP starttls
> in my case).
>
> Possibly it's some kind of race condition when client sends TLS hello
> before server does ssl_accept(). Maybe some active/passive socket mode
> issue.
>
> If you control the client code, could you add some sleep before starttls
> and check if that fixes the issue?
>
> On Wed, Dec 23, 2015 at 8:38 PM, Ulf Wiger <ulf@REDACTED> wrote:
>
>> Hmm? I send this to erlang-bugs, but it didn?t seem to get through.
>>
>> When connecting some Android software to an Erlang node using TLS, we
>> sometimes (about 1 in 3 or 4 times) get the following errors:
>>
>> 2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state
>> hello terminated with reason: no function clause matching
>> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
>> line 450
>>
>> 15:31:00.783<dlink_tls_conn/327>dlink_tls_conn:terminate(): Reason:
>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}}
>>
>> 2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0>
>> with 0 neighbours exited with reason: no function clause matching
>> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
>> line 450 in gen_fsm:terminate/7 line 626
>>
>> 2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> terminated
>> with reason:
>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...}
>> in gen_fsm:sync_send_all_state_event/3 line 257
>>
>> 2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0>
>> with 0 neighbours exited with reason:
>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...}
>> in gen_server:terminate/7 line 826
>>
>> 2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup
>> had child undefined started with {tls_connection,start_link,undefined} at
>> <0.210.0> exit with reason no function clause
>> matching ssl_handshake:update_handshake_history(undefined,
>> <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
>> line 450 in context child_terminated
>>
>>
>> We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side has
>> the following options:
>>
>> [{verify,verify_peer},
>> {certfile,"/home/.../device_cert.crt?},
>> {keyfile,"/home/.../device_key.pem?},
>> {cacertfile,"/home/.../root_cert.crt?},
>> {verify_fun,{#Fun<dlink_tls_conn.65.24728257>,{'RSAPublicKey?,...}}},
>> {partial_chain,#Fun<dlink_tls_conn.64.24728257>}]
>>
>> Basically, the verify_fun validates a self-signed cert
>>
>> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393
>>
>> and the partial_chain fun most likely does much less than it should
>>
>> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421
>>
>> On the Android side, we?re using Android 4.4.2 (API 19).
>>
>> It feels like a timing-related problem on the erlang side.
>>
>> Let me know if you need more information.
>>
>> BR,
>> Ulf W
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>
>
> --
> Danil Zagoskin | z@REDACTED
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151225/964f1c08/attachment.htm>

From ulf@REDACTED  Sat Dec 26 21:18:42 2015
From: ulf@REDACTED (Ulf Wiger)
Date: Sat, 26 Dec 2015 21:18:42 +0100
Subject: [erlang-bugs] SSL handshake crash
In-Reply-To: <CADPRLo-YmQofubWL6KiGfnm3NBYcQamfgJG=hs3Nc+GTtBDSFg@mail.gmail.com>
References: <CADPRLo_vd1dcgYD1RGNtCdQJVdgRHfSe1EuJO0B1h+tbMv3HAg@mail.gmail.com>
 <CAJ6dJEj6rZFiF4=dq34ugASneKVTuo_ssQHOU-Us2hmcAZCAnw@mail.gmail.com>
 <B3CF142BE0AC334585C9F2076B18F8D2532C1B4E@ESESSMB205.ericsson.se>
 <CADPRLo-YmQofubWL6KiGfnm3NBYcQamfgJG=hs3Nc+GTtBDSFg@mail.gmail.com>
Message-ID: <CADPRLo_HoC7SykVukwpA=qJUY_046poLvaxv6snG0fpBnoD2JQ@mail.gmail.com>

To clarify, as far as I can tell, the code in question does set 'active' to
false.

BR,
Ulf W

2015-12-25 21:40 GMT+01:00 Ulf Wiger <ulf@REDACTED>:

> Hi Ingela,
>
> 'active' should be set to false:
>
>
> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L346
>
> BR,
> Ulf W
>
> 2015-12-25 12:54 GMT+01:00 Ingela Anderton Andin <
> Ingela.Anderton.Andin@REDACTED>:
>
>> Hi!
>>
>>
>>
>> From ssl users guide
>>
>>
>>
>>  "Ensure active is set to false before trying to upgrade a connection to
>> an SSL connection, otherwise SSL handshake messages can be delivered to the
>> wrong process."
>>
>>
>>
>> Regards Ingela Erlang/OTP team - Ericsson AB
>> ------------------------------
>> *Fr?n:* erlang-bugs-bounces@REDACTED [erlang-bugs-bounces@REDACTED]
>> f?r Danil Zagoskin [z@REDACTED]
>> *Skickat:* den 24 december 2015 10:22
>> *Till:* Ulf Wiger
>> *Kopia:* erlang-bugs@REDACTED
>> *?mne:* Re: [erlang-bugs] SSL handshake crash
>>
>> Hi!
>>
>> I have the same issue, but not so often.
>> It seems to appear only when upgrading plain socket to TLS (XMPP starttls
>> in my case).
>>
>> Possibly it's some kind of race condition when client sends TLS hello
>> before server does ssl_accept(). Maybe some active/passive socket mode
>> issue.
>>
>> If you control the client code, could you add some sleep before starttls
>> and check if that fixes the issue?
>>
>> On Wed, Dec 23, 2015 at 8:38 PM, Ulf Wiger <ulf@REDACTED> wrote:
>>
>>> Hmm? I send this to erlang-bugs, but it didn?t seem to get through.
>>>
>>> When connecting some Android software to an Erlang node using TLS, we
>>> sometimes (about 1 in 3 or 4 times) get the following errors:
>>>
>>> 2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state
>>> hello terminated with reason: no function clause matching
>>> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
>>> line 450
>>>
>>> 15:31:00.783<dlink_tls_conn/327>dlink_tls_conn:terminate(): Reason:
>>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}}
>>>
>>> 2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0>
>>> with 0 neighbours exited with reason: no function clause matching
>>> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
>>> line 450 in gen_fsm:terminate/7 line 626
>>>
>>> 2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0>
>>> terminated with reason:
>>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...}
>>> in gen_fsm:sync_send_all_state_event/3 line 257
>>>
>>> 2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0>
>>> with 0 neighbours exited with reason:
>>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...}
>>> in gen_server:terminate/7 line 826
>>>
>>> 2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup
>>> had child undefined started with {tls_connection,start_link,undefined} at
>>> <0.210.0> exit with reason no function clause
>>> matching ssl_handshake:update_handshake_history(undefined,
>>> <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>)
>>> line 450 in context child_terminated
>>>
>>>
>>> We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side
>>> has the following options:
>>>
>>> [{verify,verify_peer},
>>> {certfile,"/home/.../device_cert.crt?},
>>> {keyfile,"/home/.../device_key.pem?},
>>> {cacertfile,"/home/.../root_cert.crt?},
>>> {verify_fun,{#Fun<dlink_tls_conn.65.24728257>,{'RSAPublicKey?,...}}},
>>> {partial_chain,#Fun<dlink_tls_conn.64.24728257>}]
>>>
>>> Basically, the verify_fun validates a self-signed cert
>>>
>>> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393
>>>
>>> and the partial_chain fun most likely does much less than it should
>>>
>>> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421
>>>
>>> On the Android side, we?re using Android 4.4.2 (API 19).
>>>
>>> It feels like a timing-related problem on the erlang side.
>>>
>>> Let me know if you need more information.
>>>
>>> BR,
>>> Ulf W
>>>
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>
>>
>> --
>> Danil Zagoskin | z@REDACTED
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20151226/906609ed/attachment.htm>