From paul.joseph.davis@REDACTED Wed Dec 2 03:04:03 2015 From: paul.joseph.davis@REDACTED (Paul Davis) Date: Tue, 1 Dec 2015 21:04:03 -0500 Subject: [erlang-bugs] Dirty schedulers and '-smp disable' In-Reply-To: References: Message-ID: I just butted up against this as well. Testing some code on a single core virtual box vm and wasn't used to smp being disabled by default. I've reproduced the behavior exactly as Knut described. One thing further that was also icky is that the ErlNifSysInfo struct has the dirty_scheduler_support flag set to 1 even when dirty schedulers don't work (due to smp being disabled on single core VMs). Thus, if you want to be super duper certain you have to check that smp_support is enabled as well. While not a terrible inconvenience once you know about it, I definitely managed to spend two hours figuring it out. On Wed, Jul 1, 2015 at 4:25 AM, Knut Nesheim wrote: > Yes, your answer makes sense. > > Just to clarify, the VM has been built with smp support, but when it > boots up on a single-core machine it doesn't enable smp because it > only detects one logical processor. I was able to reproduce the "dirty > nif stuck" problem with "erl -smp disable". > > Knut > > On Tue, Jun 30, 2015 at 5:06 PM, Steve Vinoski wrote: >> >> >> On Tue, Jun 30, 2015 at 10:39 AM, Knut Nesheim wrote: >>> >>> Dear list, >>> >>> I ran into unexpected behaviour in the following situation: >>> >>> * OTP 18.0, compiled from the git tag with dirty schedulers enabled >>> * NIF with the ERL_NIF_DIRTY_JOB_CPU_BOUND flag >>> * Small machine with only one core (AWS t1.micro) >>> * The first log line from startup with no explicit flags looks like >>> this: Erlang/OTP 18 [erts-7.0] [source] [64-bit] [async-threads:10] >>> [hipe] [kernel-poll:false] >>> >>> When I call the NIF, the calling process hangs forever. When I call it >>> from the shell, I'm unable to interrupt the process (C-g, i 1 does >>> nothing useful). >>> >>> If I explicitly use '-smp enable' as arguments to erl, the NIF runs >>> fine. In that case the first log line looks like this: Erlang/OTP 18 >>> [erts-7.0] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:10] >>> [hipe] [kernel-poll:false] >>> >>> This behaviour got me a bit confused, as there is no indication what >>> is happening except "something somewhere got stuck". It's not a common >>> case for me, as most machines have multiple cores except tiny cloud >>> instances or virtual machines. >> >> >> The short answer is that currently, dirty schedulers always require SMP. >> >> The longer answer is that configure should raise an error if this >> configuration is attempted. I can't recall for sure but I think it behaved >> like this at one point, but a lot changed for Erlang 18 and so perhaps this >> config check got lost along the way. >> >> --steve > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs From vinoski@REDACTED Wed Dec 2 14:53:03 2015 From: vinoski@REDACTED (Steve Vinoski) Date: Wed, 2 Dec 2015 08:53:03 -0500 Subject: [erlang-bugs] Dirty schedulers and '-smp disable' In-Reply-To: References: Message-ID: On Tue, Dec 1, 2015 at 9:04 PM, Paul Davis wrote: > I just butted up against this as well. Testing some code on a single > core virtual box vm and wasn't used to smp being disabled by default. > I've reproduced the behavior exactly as Knut described. > > One thing further that was also icky is that the ErlNifSysInfo struct > has the dirty_scheduler_support flag set to 1 even when dirty > schedulers don't work (due to smp being disabled on single core VMs). > Thus, if you want to be super duper certain you have to check that > smp_support is enabled as well. While not a terrible inconvenience > once you know about it, I definitely managed to spend two hours > figuring it out. > That too seems like a bug, for now anyway. Only the OTP team can authoritatively state the plans for dirty schedulers, but I'm still involved in working on them and my understanding is there's a push to get them out of experimental status and into regular feature status for Erlang 19. Part of that push includes an effort to make them work even if SMP is disabled. --steve > > > On Wed, Jul 1, 2015 at 4:25 AM, Knut Nesheim wrote: > > Yes, your answer makes sense. > > > > Just to clarify, the VM has been built with smp support, but when it > > boots up on a single-core machine it doesn't enable smp because it > > only detects one logical processor. I was able to reproduce the "dirty > > nif stuck" problem with "erl -smp disable". > > > > Knut > > > > On Tue, Jun 30, 2015 at 5:06 PM, Steve Vinoski wrote: > >> > >> > >> On Tue, Jun 30, 2015 at 10:39 AM, Knut Nesheim > wrote: > >>> > >>> Dear list, > >>> > >>> I ran into unexpected behaviour in the following situation: > >>> > >>> * OTP 18.0, compiled from the git tag with dirty schedulers enabled > >>> * NIF with the ERL_NIF_DIRTY_JOB_CPU_BOUND flag > >>> * Small machine with only one core (AWS t1.micro) > >>> * The first log line from startup with no explicit flags looks like > >>> this: Erlang/OTP 18 [erts-7.0] [source] [64-bit] [async-threads:10] > >>> [hipe] [kernel-poll:false] > >>> > >>> When I call the NIF, the calling process hangs forever. When I call it > >>> from the shell, I'm unable to interrupt the process (C-g, i 1 does > >>> nothing useful). > >>> > >>> If I explicitly use '-smp enable' as arguments to erl, the NIF runs > >>> fine. In that case the first log line looks like this: Erlang/OTP 18 > >>> [erts-7.0] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:10] > >>> [hipe] [kernel-poll:false] > >>> > >>> This behaviour got me a bit confused, as there is no indication what > >>> is happening except "something somewhere got stuck". It's not a common > >>> case for me, as most machines have multiple cores except tiny cloud > >>> instances or virtual machines. > >> > >> > >> The short answer is that currently, dirty schedulers always require SMP. > >> > >> The longer answer is that configure should raise an error if this > >> configuration is attempted. I can't recall for sure but I think it > behaved > >> like this at one point, but a lot changed for Erlang 18 and so perhaps > this > >> config check got lost along the way. > >> > >> --steve > > _______________________________________________ > > erlang-bugs mailing list > > erlang-bugs@REDACTED > > http://erlang.org/mailman/listinfo/erlang-bugs > -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.joseph.davis@REDACTED Wed Dec 2 18:49:36 2015 From: paul.joseph.davis@REDACTED (Paul Davis) Date: Wed, 2 Dec 2015 12:49:36 -0500 Subject: [erlang-bugs] Dirty schedulers and '-smp disable' In-Reply-To: References: Message-ID: That's fair. Although I don't care that they don't work on non-SMP VMs. It was just that they failed in a non-obvious manner. For instance, an error when loading a NIF that specifies a dirty scheduler in an ErlNifFunc or when passing a dirty scheduler flag to enif_schedule_nif would've probably been enough to point out the issue. On Wed, Dec 2, 2015 at 8:53 AM, Steve Vinoski wrote: > > > On Tue, Dec 1, 2015 at 9:04 PM, Paul Davis > wrote: >> >> I just butted up against this as well. Testing some code on a single >> core virtual box vm and wasn't used to smp being disabled by default. >> I've reproduced the behavior exactly as Knut described. >> >> One thing further that was also icky is that the ErlNifSysInfo struct >> has the dirty_scheduler_support flag set to 1 even when dirty >> schedulers don't work (due to smp being disabled on single core VMs). >> Thus, if you want to be super duper certain you have to check that >> smp_support is enabled as well. While not a terrible inconvenience >> once you know about it, I definitely managed to spend two hours >> figuring it out. > > > That too seems like a bug, for now anyway. > > Only the OTP team can authoritatively state the plans for dirty schedulers, > but I'm still involved in working on them and my understanding is there's a > push to get them out of experimental status and into regular feature status > for Erlang 19. Part of that push includes an effort to make them work even > if SMP is disabled. > > --steve > > >> >> >> >> On Wed, Jul 1, 2015 at 4:25 AM, Knut Nesheim wrote: >> > Yes, your answer makes sense. >> > >> > Just to clarify, the VM has been built with smp support, but when it >> > boots up on a single-core machine it doesn't enable smp because it >> > only detects one logical processor. I was able to reproduce the "dirty >> > nif stuck" problem with "erl -smp disable". >> > >> > Knut >> > >> > On Tue, Jun 30, 2015 at 5:06 PM, Steve Vinoski wrote: >> >> >> >> >> >> On Tue, Jun 30, 2015 at 10:39 AM, Knut Nesheim >> >> wrote: >> >>> >> >>> Dear list, >> >>> >> >>> I ran into unexpected behaviour in the following situation: >> >>> >> >>> * OTP 18.0, compiled from the git tag with dirty schedulers enabled >> >>> * NIF with the ERL_NIF_DIRTY_JOB_CPU_BOUND flag >> >>> * Small machine with only one core (AWS t1.micro) >> >>> * The first log line from startup with no explicit flags looks like >> >>> this: Erlang/OTP 18 [erts-7.0] [source] [64-bit] [async-threads:10] >> >>> [hipe] [kernel-poll:false] >> >>> >> >>> When I call the NIF, the calling process hangs forever. When I call it >> >>> from the shell, I'm unable to interrupt the process (C-g, i 1 does >> >>> nothing useful). >> >>> >> >>> If I explicitly use '-smp enable' as arguments to erl, the NIF runs >> >>> fine. In that case the first log line looks like this: Erlang/OTP 18 >> >>> [erts-7.0] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:10] >> >>> [hipe] [kernel-poll:false] >> >>> >> >>> This behaviour got me a bit confused, as there is no indication what >> >>> is happening except "something somewhere got stuck". It's not a common >> >>> case for me, as most machines have multiple cores except tiny cloud >> >>> instances or virtual machines. >> >> >> >> >> >> The short answer is that currently, dirty schedulers always require >> >> SMP. >> >> >> >> The longer answer is that configure should raise an error if this >> >> configuration is attempted. I can't recall for sure but I think it >> >> behaved >> >> like this at one point, but a lot changed for Erlang 18 and so perhaps >> >> this >> >> config check got lost along the way. >> >> >> >> --steve >> > _______________________________________________ >> > erlang-bugs mailing list >> > erlang-bugs@REDACTED >> > http://erlang.org/mailman/listinfo/erlang-bugs > > From kenji@REDACTED Thu Dec 17 04:10:39 2015 From: kenji@REDACTED (Kenji Rikitake) Date: Thu, 17 Dec 2015 12:10:39 +0900 Subject: [erlang-bugs] OTP 18.2 HiPE fix on FreeBSD 10.2 Message-ID: <20151217031039.GA38951@k2r.org> https://github.com/erlang/otp/pull/925 OTP 18.2 on FreeBSD 10.2-STABLE does not compile with HiPE enabled. 18.1.5 worked ok, so I guess the recent change for musl libc affected. The following includes a quick workaround and I need FreeBSD people to further test the HiPE functionalities. (Any good test cases?) Regards, Kenji Rikitake From kenji@REDACTED Thu Dec 17 13:12:29 2015 From: kenji@REDACTED (Kenji Rikitake) Date: Thu, 17 Dec 2015 21:12:29 +0900 Subject: [erlang-bugs] OTP 18.2 HiPE fix on FreeBSD 10.2 In-Reply-To: <20151217031039.GA38951@k2r.org> References: <20151217031039.GA38951@k2r.org> Message-ID: I mixed up patches for 18.2 and master branches. Here's the fixed one for 18.2: https://github.com/erlang/otp/pull/926 Kenji On Thu, Dec 17, 2015 at 12:10 PM, Kenji Rikitake wrote: > https://github.com/erlang/otp/pull/925 > > OTP 18.2 on FreeBSD 10.2-STABLE does not compile with HiPE enabled. > 18.1.5 worked ok, so I guess the recent change for musl libc affected. > The following includes a quick workaround and I need FreeBSD people > to further test the HiPE functionalities. (Any good test cases?) > > Regards, > Kenji Rikitake > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenji@REDACTED Fri Dec 18 04:57:18 2015 From: kenji@REDACTED (Kenji Rikitake) Date: Fri, 18 Dec 2015 12:57:18 +0900 Subject: [erlang-bugs] OTP 18.2 HiPE fix on FreeBSD 10.2 In-Reply-To: <1450360334.676610.470067657.034227CC@webmail.messagingengine.com> References: <20151217031039.GA38951@k2r.org> <1450360334.676610.470067657.034227CC@webmail.messagingengine.com> Message-ID: Kawano-san: very much appreciated. I've tested with --enable-hipe --enable-fp-exceptions --enable-native-libs and so far the BEAM with HiPE seems to be working. I have to check out the following issues: * Is the sigaction() handling really OK on FreeBSD? * Is the dlsym() handling really OK on FreeBSD? Maybe I need more input from FreeBSD people. For those who want to test a tentative Port, check here: https://github.com/jj1bdx/erlang-freebsd-port/tree/18.2-20151218 though I'm sure Jimmy Olgeni, the maintainer of FreeBSD Erlang Ports, will override mine in a short period. Regards, Kenji Rikitake On Thu, Dec 17, 2015 at 10:52 PM, Tatsuya Kawano wrote: > Hi Kenji, > > On Thu, Dec 17, 2015, at 08:12 PM CST, Kenji Rikitake wrote: > >> The following includes a quick workaround and I need FreeBSD people > >> to further test the HiPE functionalities. (Any good test cases?) > ... > > I mixed up patches for 18.2 and master branches. Here's the fixed one > for 18.2: > > > > https://github.com/erlang/otp/pull/926 > > Thank you for the patch. It worked like a charm; I was able to build OTP > 18.2 on FreeBSD 10.2 with HiPE enabled. > > So far, I have only tested it against boundary bear > , which has HiPE enabled by default. > It passed all eunit cases. > > -------------------------------------------------- > /home/tatsuya% freebsd-version > 10.2-RELEASE-p8 > > /home/tatsuya% cat .kerlrc > KERL_CONFIGURE_OPTIONS="--enable-hipe --enable-smp-support > --enable-threads --enable-kernel-poll" > > /home/tatsuya% kerl build git https://github.com/jj1bdx/otp.git \ > jj1bdx-18.2-freebsd-hipe-fix-2 18.2_hipe_pr926 > Checking Erlang/OTP git repository from > https://github.com/jj1bdx/otp.git... > Building Erlang/OTP 18.2_hipe_pr926 from git, please wait... > Erlang/OTP 18.2_hipe_pr926 from git has been successfully built > > /home/tatsuya% kerl install 18.2_hipe_pr926 ~/erlang/18.2_hipe_pr926 > Installing Erlang/OTP git (18.2_hipe_pr926) in > /home/tatsuya/erlang/18.2_hipe_pr926... > You can activate this installation running the following command: > . /home/tatsuya/erlang/18.2_hipe_pr926/activate > Later on, you can leave the installation typing: > kerl_deactivate > > /home/tatsuya% . /home/tatsuya/erlang/18.2_hipe_pr926/activate > /home/tatsuya% erl > Erlang/OTP 18 [erts-7.2] [source-e616e04] [64-bit] [smp:8:8] > [async-threads:10] [hipe] [kernel-poll:false] > > Eshell V7.2 (abort with ^G) > 1> > User switch command > --> q > > /home/tatsuya% cd workhub/dev/hibari/hibari/lib/bear/ > /home/tatsuya/workhub/dev/hibari/hibari/lib/bear% grep native src/* > src/bear.erl:-compile([native]). > > /home/tatsuya/workhub/dev/hibari/hibari/lib/bear% ./rebar clean compile > eunit > ==> bear (clean) > ==> bear (compile) > Compiled src/bear.erl > ==> bear (eunit) > Compiled test/bear_test.erl > Compiled src/bear.erl > All 47 tests passed. > Cover analysis: > /usr/home/tatsuya/workhub/dev/hibari/hibari/lib/bear/.eunit/index.html > -------------------------------------------------- > > Thanks, > Tatsuya Kawano > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tatsuya@REDACTED Thu Dec 17 15:05:09 2015 From: tatsuya@REDACTED (Tatsuya Kawano) Date: Thu, 17 Dec 2015 22:05:09 +0800 Subject: [erlang-bugs] OTP 18.2 HiPE fix on FreeBSD 10.2 In-Reply-To: References: <20151217031039.GA38951@k2r.org> Message-ID: <1450361109.678781.470088585.009E7C7E@webmail.messagingengine.com> Hi Kenji, On Thu, Dec 17, 2015, at 08:12 PM CST, Kenji Rikitake wrote: >> The following includes a quick workaround and I need FreeBSD people >> to further test the HiPE functionalities. (Any good test cases?) > ... > I mixed up patches for 18.2 and master branches. Here's the fixed one for 18.2: > > https://github.com/erlang/otp/pull/926 Thank you for the patch. It worked like a charm; I was able to build OTP 18.2 on FreeBSD 10.2 with HiPE enabled. So far, I have only tested it against boundary bear , which has HiPE enabled by default. It passed all eunit cases. -------------------------------------------------- /home/tatsuya% freebsd-version 10.2-RELEASE-p8 /home/tatsuya% cat .kerlrc KERL_CONFIGURE_OPTIONS="--enable-hipe --enable-smp-support --enable-threads --enable-kernel-poll" /home/tatsuya% kerl build git https://github.com/jj1bdx/otp.git \ jj1bdx-18.2-freebsd-hipe-fix-2 18.2_hipe_pr926 Checking Erlang/OTP git repository from https://github.com/jj1bdx/otp.git... Building Erlang/OTP 18.2_hipe_pr926 from git, please wait... Erlang/OTP 18.2_hipe_pr926 from git has been successfully built /home/tatsuya% kerl install 18.2_hipe_pr926 ~/erlang/18.2_hipe_pr926 Installing Erlang/OTP git (18.2_hipe_pr926) in /home/tatsuya/erlang/18.2_hipe_pr926... You can activate this installation running the following command: . /home/tatsuya/erlang/18.2_hipe_pr926/activate Later on, you can leave the installation typing: kerl_deactivate /home/tatsuya% . /home/tatsuya/erlang/18.2_hipe_pr926/activate /home/tatsuya% erl Erlang/OTP 18 [erts-7.2] [source-e616e04] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] Eshell V7.2 (abort with ^G) 1> User switch command --> q /home/tatsuya% cd workhub/dev/hibari/hibari/lib/bear/ /home/tatsuya/workhub/dev/hibari/hibari/lib/bear% grep native src/* src/bear.erl:-compile([native]). /home/tatsuya/workhub/dev/hibari/hibari/lib/bear% ./rebar clean compile eunit ==> bear (clean) ==> bear (compile) Compiled src/bear.erl ==> bear (eunit) Compiled test/bear_test.erl Compiled src/bear.erl All 47 tests passed. Cover analysis: /usr/home/tatsuya/workhub/dev/hibari/hibari/lib/bear/.eunit/index.html -------------------------------------------------- Thanks, Tatsuya Kawano From isreal-erlang-bugs-at-erlang.org@REDACTED Fri Dec 18 15:19:45 2015 From: isreal-erlang-bugs-at-erlang.org@REDACTED (David Buckley) Date: Fri, 18 Dec 2015 14:19:45 +0000 Subject: [erlang-bugs] NIF .so reload issues Message-ID: <20151218141945.GA3897@cirno.bucko.me.uk> Hi! I was playing with writing a NIF, and found I couldn't reload. I'm doing the sort-of accepted thing of loading the nif in an on_load function, though if I just execute the function just after load, I get the same behaviour, so I don't think that's at issue. Basically, what seems to be the case is that while erlang will re-initialise my nif code (with 'upgrade'), it won't load a /new/ version of the nif code unless I completely purge the (erlang) code from the runtime, forcing erlang to recheck the module. I'm guessing erlang is caching the nif. Changing the compiled (.so) filename each time fixes the problem. Example code here: https://gist.github.com/bucko909/a3b5099c74bf267e65db test_reload_post_purge and test_reload_post_reload_complete_purge work fine (erts-7.1), but the other three don't reload the .so file as I would expect. Is this fixable, or must I manually add a purge() in my init() function before load_nif? (And why does that work? Because at that point there's no evidence that the new module will have a load_nif, so the old dlopen can be discarded?) Seems like in general if the .so file has changed and a module is reloaded, the user probably wants the new .so file, too! It's at least worth adding a note to the docs (or a new return value?) if it's an evil dlopen restriction. -- David Buckley From sverker.eriksson@REDACTED Fri Dec 18 17:41:24 2015 From: sverker.eriksson@REDACTED (Sverker Eriksson) Date: Fri, 18 Dec 2015 17:41:24 +0100 Subject: [erlang-bugs] NIF .so reload issues In-Reply-To: <20151218141945.GA3897@cirno.bucko.me.uk> References: <20151218141945.GA3897@cirno.bucko.me.uk> Message-ID: <56743734.3040509@ericsson.com> Hi David, Yes, this is a dlopen restriction and also an ambiguity as I've heard different behaviour reported depending on OS. My Linux man page for dlopen says "If the same library is loaded again with dlopen(), the same file handle is returned". But it does not specify what "the same" actually means. The Erlang VM has to keep the old .so file loaded until the module is safely purged [*] as there may exist Erlang processes still lingering in the old code. Trying to execute unloaded native code does not behave well. When you call load_nif with the same library name (as the not yet purged one), dlopen thinks it's "the same" library and just returns the same handle again. What to do? Rename the .so library, give it a version number. Or maybe put it in a different directory will work (?). Add something about this problem to the erl_nif docs. Yes that would be nice. I'm hesitant to recommend purging in on_load. The on_load feature is still experimental and we have some known problems with bad behaviour, especially in the error cases when on_load fails. To fix that we may have to limit what you are allowed to do in on_load and code purging might be such a limitation. [*] Purging may actually not be enough. If the NIF library has created resource objects with a destructor callcack, it will not be unloaded until the last resource objects has been garbage collected. /Sverker, Erlang/OTP On 12/18/2015 03:19 PM, David Buckley wrote: > Hi! I was playing with writing a NIF, and found I couldn't reload. > > I'm doing the sort-of accepted thing of loading the nif in an on_load > function, though if I just execute the function just after load, I get > the same behaviour, so I don't think that's at issue. > > Basically, what seems to be the case is that while erlang will > re-initialise my nif code (with 'upgrade'), it won't load a /new/ > version of the nif code unless I completely purge the (erlang) code from > the runtime, forcing erlang to recheck the module. I'm guessing erlang > is caching the nif. Changing the compiled (.so) filename each time fixes > the problem. > > > Example code here: > > https://gist.github.com/bucko909/a3b5099c74bf267e65db > > test_reload_post_purge and test_reload_post_reload_complete_purge work > fine (erts-7.1), but the other three don't reload the .so file as I > would expect. > > > Is this fixable, or must I manually add a purge() in my init() function > before load_nif? (And why does that work? Because at that point there's > no evidence that the new module will have a load_nif, so the old dlopen > can be discarded?) > > Seems like in general if the .so file has changed and a module is > reloaded, the user probably wants the new .so file, too! It's at least > worth adding a note to the docs (or a new return value?) if it's an evil > dlopen restriction. > From isreal-erlang-bugs-at-erlang.org@REDACTED Fri Dec 18 18:07:33 2015 From: isreal-erlang-bugs-at-erlang.org@REDACTED (David Buckley) Date: Fri, 18 Dec 2015 17:07:33 +0000 Subject: [erlang-bugs] NIF .so reload issues In-Reply-To: <56743734.3040509@ericsson.com> References: <20151218141945.GA3897@cirno.bucko.me.uk> <56743734.3040509@ericsson.com> Message-ID: <20151218170733.GA10347@cirno.bucko.me.uk> On Fri, Dec 18, 2015 at 05:41:24PM +0100, Sverker Eriksson wrote: > Hi David, > > Yes, this is a dlopen restriction and also an ambiguity as I've heard > different behaviour reported depending on OS. > > My Linux man page for dlopen says "If the same library is loaded again with > dlopen(), > the same file handle is returned". But it does not specify what "the same" > actually means. > > The Erlang VM has to keep the old .so file loaded until the module is safely > purged [*] > as there may exist Erlang processes still lingering in the old code. Trying > to execute > unloaded native code does not behave well. > > When you call load_nif with the same library name (as the > not yet purged one), dlopen thinks it's "the same" library > and just returns the same handle again. > > What to do? > > Rename the .so library, give it a version number. Or maybe > put it in a different directory will work (?). > > Add something about this problem to the erl_nif docs. Yes that would be > nice. > > I'm hesitant to recommend purging in on_load. The on_load feature > is still experimental and we have some known problems with bad > behaviour, especially in the error cases when on_load fails. > To fix that we may have to limit what you are allowed > to do in on_load and code purging might be such a limitation. > > > [*] Purging may actually not be enough. If the NIF library has created > resource objects with a destructor callcack, it will not be unloaded until > the last resource objects has been garbage collected. Hmmm, I was going to create resources! I guess for development I'll add a hack that just creates a link to the file with a temporary name before loading it, so that a new handle to it is created each time. There /is/ a secret RTLD_PRIVATE flag for dlopen -- that is not apparently supported on any OS mentioned on the first page of google -- to get a private instance. For production, versioning the library code ought to be fine. Most system libraries already contain version numbers in the filename, and I suppose this is part of why. It's only reloading for rapid development that is causing pain here! Is the old dlopen bound to the old (Erlang) code? That is, if I instigate this hack, and leak resources somehow while reloading often, will I have problems reloading the module, cause processes to be violently uprooted as with purge, or simply leak dlopen handles until I clean up? Is there any chance of purge/soft_purge being extended to cover nif resources? -- David Buckley From sverker.eriksson@REDACTED Fri Dec 18 18:55:27 2015 From: sverker.eriksson@REDACTED (Sverker Eriksson) Date: Fri, 18 Dec 2015 18:55:27 +0100 Subject: [erlang-bugs] NIF .so reload issues In-Reply-To: <20151218170733.GA10347@cirno.bucko.me.uk> References: <20151218141945.GA3897@cirno.bucko.me.uk> <56743734.3040509@ericsson.com> <20151218170733.GA10347@cirno.bucko.me.uk> Message-ID: <5674488F.2080809@ericsson.com> On 12/18/2015 06:07 PM, David Buckley wrote: > On Fri, Dec 18, 2015 at 05:41:24PM +0100, Sverker Eriksson wrote: >> Hi David, >> >> Yes, this is a dlopen restriction and also an ambiguity as I've heard >> different behaviour reported depending on OS. >> >> My Linux man page for dlopen says "If the same library is loaded again with >> dlopen(), >> the same file handle is returned". But it does not specify what "the same" >> actually means. >> >> The Erlang VM has to keep the old .so file loaded until the module is safely >> purged [*] >> as there may exist Erlang processes still lingering in the old code. Trying >> to execute >> unloaded native code does not behave well. >> >> When you call load_nif with the same library name (as the >> not yet purged one), dlopen thinks it's "the same" library >> and just returns the same handle again. >> >> What to do? >> >> Rename the .so library, give it a version number. Or maybe >> put it in a different directory will work (?). >> >> Add something about this problem to the erl_nif docs. Yes that would be >> nice. >> >> I'm hesitant to recommend purging in on_load. The on_load feature >> is still experimental and we have some known problems with bad >> behaviour, especially in the error cases when on_load fails. >> To fix that we may have to limit what you are allowed >> to do in on_load and code purging might be such a limitation. >> >> >> [*] Purging may actually not be enough. If the NIF library has created >> resource objects with a destructor callcack, it will not be unloaded until >> the last resource objects has been garbage collected. > Hmmm, I was going to create resources! > > I guess for development I'll add a hack that just creates a link to the > file with a temporary name before loading it, so that a new handle to it > is created each time. I'm not sure dlopen is fooled by a link. You may need to make a real copy. > There /is/ a secret RTLD_PRIVATE flag for dlopen > -- that is not apparently supported on any OS mentioned on the first > page of google -- to get a private instance. > > For production, versioning the library code ought to be fine. Most > system libraries already contain version numbers in the filename, and I > suppose this is part of why. It's only reloading for rapid development > that is causing pain here! > > Is the old dlopen bound to the old (Erlang) code? Yes. A NIF library is like an extension of the Erlang code that loaded it. > That is, if I > instigate this hack, and leak resources somehow while reloading often, > will I have problems reloading the module, cause processes to be > violently uprooted as with purge, or simply leak dlopen handles until I > clean up? If you leak resources then you will also leak the loaded libraries that contain the destructor functions of those resources. However, there is way for your upgraded NIF library to take over ownership of existing resources by passing the ERL_NIF_RT_TAKEOVER flag to enif_open_resource_type(). By doing that, the destructor in your new library will be called instead and the old library can be unloaded when the module is purged. Your new library versions must of course be data compatible and know how to handle old resources. > > Is there any chance of purge/soft_purge being extended to cover nif > resources? > Oh, that's a good question. Why don't we do that already? I have to think about that. /Sverker, Erlang/OTP From isreal-erlang-bugs-at-erlang.org@REDACTED Sun Dec 20 20:48:42 2015 From: isreal-erlang-bugs-at-erlang.org@REDACTED (David Buckley) Date: Sun, 20 Dec 2015 19:48:42 +0000 Subject: [erlang-bugs] NIF resources are not checked on module unload Message-ID: <20151220194841.GA18879@cirno.bucko.me.uk> While playing with implementing a NIF, I found some segfaults, and I eventually got it down to the test case here: https://gist.github.com/bucko909/a841c716ede6d3903a13 It looks like it's down to my not re-registering the resource on upgrade (presumably the handle goes stale, is garbage collected, and eventually it corrupts memory causing segfaults in unrelated emulator code). I fell into this trap by using code from https://github.com/davisp/nif-examples -- which I've sent a pull request to fix. I fixed my problem by adding enif_open_resource to the upgrade function once I'd clocked my error, so under normal and correct use, I think the emulator is doing OK. However, it looks like if I /don't/ reopen it, it's not properly deleted, and the documentation seems to leave open the possibility of doing just this ("Existing resource objects, of a module that is upgraded, must either be deleted or taken over by the new NIF library"). References to resources with the old handle remain uncleaned. Even if I completely destroy the old module, so that unload is called, these stale resources persist until a garbage collection. They actually survive /many/ purge/load cycles in my example code before being garbage collected and segfaulting the emulator. Ideas, based on my interpretation of the bug: If there are lingering resources, which are not TAKEOVER-ed in the upgrade function, and have a dtor, this should cause an immediate emulator panic. I can't think of any other behaviour which is safe here. If they don't have a dtor, it seems safe to keep them around, but their resource handle needs to be kept alive until they are all destroyed. It ought to be impossible to create new resources using the old handle, at least when there is a dtor defined (can a 'dead' flag be set?). Knowing this behaviour, an application author writing an upgrade function for this NIF library might at least attempt to destroy all of his objects when making such an upgrade, in order to have the emulator survive! Another approach is to require an /explicit/ delete of old resources, perhaps simply a call to "enif_delete_unused_resources" or an iteration of "enif_delete_resource" over "enif_list_resources", and have this call fail where the old resources are still allocated. Perhaps the library author could force a purge or panic the emulator themselves at this point. The emulator should panic if a resource is neither deleted nor reopened with TAKEOVER. -- David Buckley From ulf@REDACTED Wed Dec 23 18:38:10 2015 From: ulf@REDACTED (Ulf Wiger) Date: Wed, 23 Dec 2015 18:38:10 +0100 Subject: [erlang-bugs] SSL handshake crash Message-ID: Hmm? I send this to erlang-bugs, but it didn?t seem to get through. When connecting some Android software to an Erlang node using TLS, we sometimes (about 1 in 3 or 4 times) get the following errors: 2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state hello terminated with reason: no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450 15:31:00.783dlink_tls_conn:terminate(): Reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}} 2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0> with 0 neighbours exited with reason: no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450 in gen_fsm:terminate/7 line 626 2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> terminated with reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} in gen_fsm:sync_send_all_state_event/3 line 257 2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0> with 0 neighbours exited with reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} in gen_server:terminate/7 line 826 2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup had child undefined started with {tls_connection,start_link,undefined} at <0.210.0> exit with reason no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450 in context child_terminated We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side has the following options: [{verify,verify_peer}, {certfile,"/home/.../device_cert.crt?}, {keyfile,"/home/.../device_key.pem?}, {cacertfile,"/home/.../root_cert.crt?}, {verify_fun,{#Fun,{'RSAPublicKey?,...}}}, {partial_chain,#Fun}] Basically, the verify_fun validates a self-signed cert https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393 and the partial_chain fun most likely does much less than it should https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421 On the Android side, we?re using Android 4.4.2 (API 19). It feels like a timing-related problem on the erlang side. Let me know if you need more information. BR, Ulf W -------------- next part -------------- An HTML attachment was scrubbed... URL: From z@REDACTED Thu Dec 24 10:22:24 2015 From: z@REDACTED (Danil Zagoskin) Date: Thu, 24 Dec 2015 12:22:24 +0300 Subject: [erlang-bugs] SSL handshake crash In-Reply-To: References: Message-ID: Hi! I have the same issue, but not so often. It seems to appear only when upgrading plain socket to TLS (XMPP starttls in my case). Possibly it's some kind of race condition when client sends TLS hello before server does ssl_accept(). Maybe some active/passive socket mode issue. If you control the client code, could you add some sleep before starttls and check if that fixes the issue? On Wed, Dec 23, 2015 at 8:38 PM, Ulf Wiger wrote: > Hmm? I send this to erlang-bugs, but it didn?t seem to get through. > > When connecting some Android software to an Erlang node using TLS, we > sometimes (about 1 in 3 or 4 times) get the following errors: > > 2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state hello > terminated with reason: no function clause matching > ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) > line 450 > > 15:31:00.783dlink_tls_conn:terminate(): Reason: > {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}} > > 2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0> > with 0 neighbours exited with reason: no function clause matching > ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) > line 450 in gen_fsm:terminate/7 line 626 > > 2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> terminated > with reason: > {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} > in gen_fsm:sync_send_all_state_event/3 line 257 > > 2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0> > with 0 neighbours exited with reason: > {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} > in gen_server:terminate/7 line 826 > > 2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup > had child undefined started with {tls_connection,start_link,undefined} at > <0.210.0> exit with reason no function clause > matching ssl_handshake:update_handshake_history(undefined, > <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) > line 450 in context child_terminated > > > We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side has > the following options: > > [{verify,verify_peer}, > {certfile,"/home/.../device_cert.crt?}, > {keyfile,"/home/.../device_key.pem?}, > {cacertfile,"/home/.../root_cert.crt?}, > {verify_fun,{#Fun,{'RSAPublicKey?,...}}}, > {partial_chain,#Fun}] > > Basically, the verify_fun validates a self-signed cert > > https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393 > > and the partial_chain fun most likely does much less than it should > > https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421 > > On the Android side, we?re using Android 4.4.2 (API 19). > > It feels like a timing-related problem on the erlang side. > > Let me know if you need more information. > > BR, > Ulf W > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs > > -- Danil Zagoskin | z@REDACTED -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ingela.Anderton.Andin@REDACTED Fri Dec 25 12:54:58 2015 From: Ingela.Anderton.Andin@REDACTED (Ingela Anderton Andin) Date: Fri, 25 Dec 2015 11:54:58 +0000 Subject: [erlang-bugs] SSL handshake crash In-Reply-To: References: , Message-ID: Hi! >From ssl users guide "Ensure active is set to false before trying to upgrade a connection to an SSL connection, otherwise SSL handshake messages can be delivered to the wrong process." Regards Ingela Erlang/OTP team - Ericsson AB ________________________________ Fr?n: erlang-bugs-bounces@REDACTED [erlang-bugs-bounces@REDACTED] f?r Danil Zagoskin [z@REDACTED] Skickat: den 24 december 2015 10:22 Till: Ulf Wiger Kopia: erlang-bugs@REDACTED ?mne: Re: [erlang-bugs] SSL handshake crash Hi! I have the same issue, but not so often. It seems to appear only when upgrading plain socket to TLS (XMPP starttls in my case). Possibly it's some kind of race condition when client sends TLS hello before server does ssl_accept(). Maybe some active/passive socket mode issue. If you control the client code, could you add some sleep before starttls and check if that fixes the issue? On Wed, Dec 23, 2015 at 8:38 PM, Ulf Wiger > wrote: Hmm? I send this to erlang-bugs, but it didn?t seem to get through. When connecting some Android software to an Erlang node using TLS, we sometimes (about 1 in 3 or 4 times) get the following errors: 2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state hello terminated with reason: no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450 15:31:00.783dlink_tls_conn:terminate(): Reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}} 2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0> with 0 neighbours exited with reason: no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450 in gen_fsm:terminate/7 line 626 2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> terminated with reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} in gen_fsm:sync_send_all_state_event/3 line 257 2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0> with 0 neighbours exited with reason: {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} in gen_server:terminate/7 line 826 2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup had child undefined started with {tls_connection,start_link,undefined} at <0.210.0> exit with reason no function clause matching ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) line 450 in context child_terminated We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side has the following options: [{verify,verify_peer}, {certfile,"/home/.../device_cert.crt?}, {keyfile,"/home/.../device_key.pem?}, {cacertfile,"/home/.../root_cert.crt?}, {verify_fun,{#Fun,{'RSAPublicKey?,...}}}, {partial_chain,#Fun}] Basically, the verify_fun validates a self-signed cert https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393 and the partial_chain fun most likely does much less than it should https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421 On the Android side, we?re using Android 4.4.2 (API 19). It feels like a timing-related problem on the erlang side. Let me know if you need more information. BR, Ulf W _______________________________________________ erlang-bugs mailing list erlang-bugs@REDACTED http://erlang.org/mailman/listinfo/erlang-bugs -- Danil Zagoskin | z@REDACTED -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulf@REDACTED Fri Dec 25 21:40:09 2015 From: ulf@REDACTED (Ulf Wiger) Date: Fri, 25 Dec 2015 21:40:09 +0100 Subject: [erlang-bugs] SSL handshake crash In-Reply-To: References: Message-ID: Hi Ingela, 'active' should be set to false: https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L346 BR, Ulf W 2015-12-25 12:54 GMT+01:00 Ingela Anderton Andin < Ingela.Anderton.Andin@REDACTED>: > Hi! > > > > From ssl users guide > > > > "Ensure active is set to false before trying to upgrade a connection to > an SSL connection, otherwise SSL handshake messages can be delivered to the > wrong process." > > > > Regards Ingela Erlang/OTP team - Ericsson AB > ------------------------------ > *Fr?n:* erlang-bugs-bounces@REDACTED [erlang-bugs-bounces@REDACTED] > f?r Danil Zagoskin [z@REDACTED] > *Skickat:* den 24 december 2015 10:22 > *Till:* Ulf Wiger > *Kopia:* erlang-bugs@REDACTED > *?mne:* Re: [erlang-bugs] SSL handshake crash > > Hi! > > I have the same issue, but not so often. > It seems to appear only when upgrading plain socket to TLS (XMPP starttls > in my case). > > Possibly it's some kind of race condition when client sends TLS hello > before server does ssl_accept(). Maybe some active/passive socket mode > issue. > > If you control the client code, could you add some sleep before starttls > and check if that fixes the issue? > > On Wed, Dec 23, 2015 at 8:38 PM, Ulf Wiger wrote: > >> Hmm? I send this to erlang-bugs, but it didn?t seem to get through. >> >> When connecting some Android software to an Erlang node using TLS, we >> sometimes (about 1 in 3 or 4 times) get the following errors: >> >> 2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state >> hello terminated with reason: no function clause matching >> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) >> line 450 >> >> 15:31:00.783dlink_tls_conn:terminate(): Reason: >> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}} >> >> 2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0> >> with 0 neighbours exited with reason: no function clause matching >> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) >> line 450 in gen_fsm:terminate/7 line 626 >> >> 2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> terminated >> with reason: >> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} >> in gen_fsm:sync_send_all_state_event/3 line 257 >> >> 2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0> >> with 0 neighbours exited with reason: >> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} >> in gen_server:terminate/7 line 826 >> >> 2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup >> had child undefined started with {tls_connection,start_link,undefined} at >> <0.210.0> exit with reason no function clause >> matching ssl_handshake:update_handshake_history(undefined, >> <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) >> line 450 in context child_terminated >> >> >> We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side has >> the following options: >> >> [{verify,verify_peer}, >> {certfile,"/home/.../device_cert.crt?}, >> {keyfile,"/home/.../device_key.pem?}, >> {cacertfile,"/home/.../root_cert.crt?}, >> {verify_fun,{#Fun,{'RSAPublicKey?,...}}}, >> {partial_chain,#Fun}] >> >> Basically, the verify_fun validates a self-signed cert >> >> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393 >> >> and the partial_chain fun most likely does much less than it should >> >> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421 >> >> On the Android side, we?re using Android 4.4.2 (API 19). >> >> It feels like a timing-related problem on the erlang side. >> >> Let me know if you need more information. >> >> BR, >> Ulf W >> >> _______________________________________________ >> erlang-bugs mailing list >> erlang-bugs@REDACTED >> http://erlang.org/mailman/listinfo/erlang-bugs >> >> > > > -- > Danil Zagoskin | z@REDACTED > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulf@REDACTED Sat Dec 26 21:18:42 2015 From: ulf@REDACTED (Ulf Wiger) Date: Sat, 26 Dec 2015 21:18:42 +0100 Subject: [erlang-bugs] SSL handshake crash In-Reply-To: References: Message-ID: To clarify, as far as I can tell, the code in question does set 'active' to false. BR, Ulf W 2015-12-25 21:40 GMT+01:00 Ulf Wiger : > Hi Ingela, > > 'active' should be set to false: > > > https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L346 > > BR, > Ulf W > > 2015-12-25 12:54 GMT+01:00 Ingela Anderton Andin < > Ingela.Anderton.Andin@REDACTED>: > >> Hi! >> >> >> >> From ssl users guide >> >> >> >> "Ensure active is set to false before trying to upgrade a connection to >> an SSL connection, otherwise SSL handshake messages can be delivered to the >> wrong process." >> >> >> >> Regards Ingela Erlang/OTP team - Ericsson AB >> ------------------------------ >> *Fr?n:* erlang-bugs-bounces@REDACTED [erlang-bugs-bounces@REDACTED] >> f?r Danil Zagoskin [z@REDACTED] >> *Skickat:* den 24 december 2015 10:22 >> *Till:* Ulf Wiger >> *Kopia:* erlang-bugs@REDACTED >> *?mne:* Re: [erlang-bugs] SSL handshake crash >> >> Hi! >> >> I have the same issue, but not so often. >> It seems to appear only when upgrading plain socket to TLS (XMPP starttls >> in my case). >> >> Possibly it's some kind of race condition when client sends TLS hello >> before server does ssl_accept(). Maybe some active/passive socket mode >> issue. >> >> If you control the client code, could you add some sleep before starttls >> and check if that fixes the issue? >> >> On Wed, Dec 23, 2015 at 8:38 PM, Ulf Wiger wrote: >> >>> Hmm? I send this to erlang-bugs, but it didn?t seem to get through. >>> >>> When connecting some Android software to an Erlang node using TLS, we >>> sometimes (about 1 in 3 or 4 times) get the following errors: >>> >>> 2015-12-22 15:31:00.772 [error] <0.210.0> gen_fsm <0.210.0> in state >>> hello terminated with reason: no function clause matching >>> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) >>> line 450 >>> >>> 15:31:00.783dlink_tls_conn:terminate(): Reason: >>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,6,0,7,0,20,0,21,0,4,0,5,0,18,0,19,0,1,0,2,0,3,0,15,0,16,0,17>>],[{file,"ssl_handshake.erl"},{line,450}]},{tls_connection,'-next_state/4-fun-0-',3,[{file,"tls_connection.erl"},{line,458}]},{tls_connection,next_state,4,[{file,"tls_connection.erl"},{line,467}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},{gen_fsm,sync_send_all_state_event,[<0.210.0>,{start,infinity},infinity]}} >>> >>> 2015-12-22 15:31:00.784 [error] <0.210.0> CRASH REPORT Process <0.210.0> >>> with 0 neighbours exited with reason: no function clause matching >>> ssl_handshake:update_handshake_history(undefined, <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) >>> line 450 in gen_fsm:terminate/7 line 626 >>> >>> 2015-12-22 15:31:00.785 [error] <0.209.0> gen_server <0.209.0> >>> terminated with reason: >>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} >>> in gen_fsm:sync_send_all_state_event/3 line 257 >>> >>> 2015-12-22 15:31:00.786 [error] <0.209.0> CRASH REPORT Process <0.209.0> >>> with 0 neighbours exited with reason: >>> {{function_clause,[{ssl_handshake,update_handshake_history,[undefined,<<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,109,210,170,150,204,23,32,228,0,0,70,0,4,0,5,0,47,0,53,192,2,192,4,192,5,192,12,192,14,192,15,192,7,192,9,192,10,192,17,192,19,192,20,0,51,0,57,0,50,0,56,0,10,192,3,192,13,192,8,192,18,0,22,0,19,0,9,0,21,0,18,0,3,0,8,0,20,0,17,0,255,1,0,0,64,0,11,0,4,3,0,1,2,0,10,0,52,0,50,0,14,0,13,0,25,0,11,0,12,0,24,0,9,0,10,0,22,0,23,0,8,0,...>>],...},...]},...} >>> in gen_server:terminate/7 line 826 >>> >>> 2015-12-22 15:31:00.787 [error] <0.109.0> Supervisor tls_connection_sup >>> had child undefined started with {tls_connection,start_link,undefined} at >>> <0.210.0> exit with reason no function clause >>> matching ssl_handshake:update_handshake_history(undefined, >>> <<1,0,0,175,3,1,86,121,221,42,209,19,198,53,3,42,92,9,16,158,197,5,169,29,247,96,14,32,123,176,...>>) >>> line 450 in context child_terminated >>> >>> >>> We run OTP Erlang/OTP 18 [erts-7.2] with ssl-7.2, and the erlang side >>> has the following options: >>> >>> [{verify,verify_peer}, >>> {certfile,"/home/.../device_cert.crt?}, >>> {keyfile,"/home/.../device_key.pem?}, >>> {cacertfile,"/home/.../root_cert.crt?}, >>> {verify_fun,{#Fun,{'RSAPublicKey?,...}}}, >>> {partial_chain,#Fun}] >>> >>> Basically, the verify_fun validates a self-signed cert >>> >>> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L393 >>> >>> and the partial_chain fun most likely does much less than it should >>> >>> https://github.com/PDXostc/rvi_core/blob/develop/components/dlink_tls/src/dlink_tls_conn.erl#L421 >>> >>> On the Android side, we?re using Android 4.4.2 (API 19). >>> >>> It feels like a timing-related problem on the erlang side. >>> >>> Let me know if you need more information. >>> >>> BR, >>> Ulf W >>> >>> _______________________________________________ >>> erlang-bugs mailing list >>> erlang-bugs@REDACTED >>> http://erlang.org/mailman/listinfo/erlang-bugs >>> >>> >> >> >> -- >> Danil Zagoskin | z@REDACTED >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: