[erlang-patches] release_handler_1 improvements

Wed Sep 14 12:14:46 CEST 2011

Hi Joe - you can just sit back and relax :) Your patch will go it's way
through pu, then opu and finally into the development branch for R14B04.
Regards
/siri

2011/9/14 Joe Williams <joe@REDACTED>

>  Siri,
>
> Anything else need to be done to graduate/merge this patch?
>
> Thanks.
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Tuesday, September 6, 2011 at 8:51 AM, Joe Williams wrote:
>
>  Great, let me know if any other changes are needed.
>
> Thanks!
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Tuesday, September 6, 2011 at 12:45 AM, Siri Hansen wrote:
>
> Hi Joe - looks good now! :)
> Thanks for your effort!
> /siri
>
> 2011/9/5 Joe Williams <joe@REDACTED>
>
>  Siri,
>
> Please refetch and let me know if the test works for you now.
>
> Thanks!
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Monday, September 5, 2011 at 2:32 AM, Siri Hansen wrote:
>
> Hi Joe - had a look at your changes... and it turned out to be quite
> interesting...
>
> First, I saw that you hadn't updated the test suite with the new exit
> reason - but when I executed the test it did not fail... :o
>
> It turned out that you catched the call to ?t:fail/1 - which actually only
> does an exit... I assume you meant "case catch" and not "catch case"...
> However, catching this is not at all necessary here, since it will be
> catched by the rpc:call anyway - it will return {badrpc, {'EXIT',
> {suspended_supervisor, _}}} .
>
> Also, two other comments on the test case:
>
> * when the call to get_supervised_procs return something unexpected -
> please print the return value in the test log (using e.g. ct:log/2) in order
> to make trouble shooting easier
> * stopping the node could be moved outside of the test case, in order to
> avoid hanging nodes if the test fails. This you can do by adding something
> like
>
> supervisor_which_children_timeout(cleanup, Conf) ->
>     stop_node(node_name(supervisor_which_children_timeout)).
>
> This will then be called during end_per_testcase. I see that there are
> other test cases that don't do this, and I will clean up that before the
> next release.
>
> Regards
> /siri
>
>
> 2011/9/1 Joe Williams <joe@REDACTED>
>
>  Seems good to me, I went ahead and made the changes.
>
> Please refetch and let me know if there is anything else.
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Thursday, September 1, 2011 at 12:59 AM, Siri Hansen wrote:
>
> Thanks Joe! This looks very good. I have two minor comments only:
>
> 1) Maybe you could change the first line of the commit message to say
>
>     General improvements to release_handler_1:get_supervised_procs
>
> or
>
>     General improvements to get_supervised_procs in release_handler_1
>
> 2) And when I looked at the tests it occured to me that it might be a good
> idea to change the error reasons to something more descriptive than
> 'release_handler_error'. What do you think? I know that there will be error
> messages in the shell, but I find it better if even the returned error
> reason gives a hint about what failed. My suggestions for the three calls to
> error/1 respectively (please change if you have better ideas):
>
> suspended_supervisor
> {which_children_failed,Other}
> {get_modules_failed,Other}
>
> Please let me know what you think about this.
>
> Regards
> /siri
>
> 2011/9/1 Joe Williams <joe@REDACTED>
>
>  Siri,
>
> I have added tests and squashed my commits, rewriting the commit message
> while I was at it. Please refetch and let me know if this is up to your
> standards for graduation.
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Tuesday, August 30, 2011 at 11:41 PM, Siri Hansen wrote:
>
> Hi Joe - Yes, you can put your application inside
> release_handler_SUITE_data/lib. And hopefully you can use the
> functions create_and_install_fake_first_release
> and create_fake_upgrade_release - you can see examples of that in
> release_handler_SUITE.erl. Please let me know if you need any more advice on
> the testing.
> Regards
> /siri
>
>
> 2011/8/30 Joe Williams <joe@REDACTED>
>
>  Siri,
>
> I could use some advice on how best to test this code. Currently I have a
> little dummy application with the proper supervision tree to trigger this
> upgrade case. Here are the steps I use to reproduce:
>
> https://gist.github.com/da109fb6939ef7aac031
>
> Note that this is using my topic branch so you can see the additional
> logging.
>
> I noticed that there is a release_handler_SUITE_data directory in the sasl
> tests, should I just add another directory specifically for this and store
> the dummy application there for testing?
>
> Thanks.
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Tuesday, August 30, 2011 at 1:44 AM, Siri Hansen wrote:
>
> Great Joe - this is much better :)
> Could you please add some tests for this also?
> Thanks
> /siri
>
> 2011/8/29 Joe Williams <joe@REDACTED>
>
> Siri,
>
> Please fetch this branch again.
>
> I have added errors where I had my functions returning empty lists. I
> believe this should bubble up to release_handler causing a restart similar
> to the timeout behavior we currently have.
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Monday, August 29, 2011 at 9:35 AM, Joe Williams wrote:
>
>  Siri,
>
> In case #2 the node would be in an "unpacked" state but perhaps that isn't
> possible since the upgrade may be partially installed already. I'll work on
> implementing #1 and reply back soon.
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Monday, August 29, 2011 at 7:31 AM, Siri Hansen wrote:
>
> Hi Joe -
> I think I would prefer solution 1), although that's probably mostly because
> I don't really understand solution 2)... What do you mean by "stop the
> upgrade from completing"? in which state would the node be after this?
> /siri
>
> 2011/8/26 Joe Williams <joe@REDACTED>
>
>  Siri,
>
> That sounds correct, with the current patch there is that risk. In my case
> I would see the error message post-upgrade and restart things as needed but
> I certainly see your point. The VM restarting is a brutal but idiomatic way
> to deal with this issue, let it fail :).
>
> I think there are two possibilities here, 1) continue with the restart
> behavior but make sure we print error messages before we do or 2) print
> error messages but stop the upgrade from completing if we catch the bad
> case. Thoughts?
>
> -Joe
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Friday, August 26, 2011 at 1:08 AM, Siri Hansen wrote:
>
> Hi again, Joe!
>
> Sorry for being so slow - but I still don't really understand :(
> My concerns are really about whether or not we should allow the upgrade to
> be performed in this case. For sure I think we should
>
> 1) avoid the timeout, and
> 2) let the user know what the problem is
>
> but is it correct to let the upgrade pass after this? Is it not an error
> situation?
>
>  It seems to me that we risk getting into a situation where we believe
> that the system is upgraded, but in fact there could be branches of the
> supervisor tree where process have not had the chance to run their
> code_change functions. I mean - even if we print the error report, there is
> no guarantee that it is really detected unless the operation actually fails.
>
> Please correct me if I completely misunderstood the situation.
>
> Regards
> /siri
>
>
> 2011/8/25 Joe Williams <joe@REDACTED>
>
>  Siri,
>
> I ran into two issues that this patch addresses. Check out the commit
> message at
> https://github.com/joewilliams/otp/commit/9c3a53789326cdd929f1c3b4525716b1c0abfe87 for
> the details. In both cases I found that in production an error in the logs
> was preferable to the restart of the VM since both are easily fixable with a
> small application change or in the case of the suspended supervisor using a
> different app up. Also see this comment in release_handler_1 regarding the
> supervisor,
> https://github.com/erlang/otp/blob/dev/lib/sasl/src/release_handler_1.erl#L454 which
> suggests this corner case is known by at least a few people. Currently there
> is no way to know *why* your VM just restarted after the upgrade in either
> case.
>
> Let me know if you have any other questions.
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Thursday, August 25, 2011 at 6:35 AM, Siri Hansen wrote:
>
> Hi again, Joe!
>
> Could you please explain a bit about the situation where you discovered
> this problem? I agree that the timeout and VM restart is not very good, and
> it makes sense to check if the supervisor is suspended. But I'm not really
> sure if it is correct to allow the upgrade to continue when this error
> occurs. Even if an error message is printed, I guess it could be quite easy
> to miss this fact... and the question is if that would be a problem or not?
> Why is the supervisor suspended in the first place?
>
> Regards
> /siri
>
>
> 2011/8/25 Siri Hansen <erlangsiri@REDACTED>
>
>     Hi Joe - I've just started looking at this. Do you think it would be
> possible to add a test case for it?
> Regards
> /siri
>
>
> 2011/8/24 Joe Williams <joe@REDACTED>
>
>  Anything I can do regarding this patch? I have happily been running it in
> production since I submitted it to the list in June.
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Wednesday, July 6, 2011 at 3:43 PM, Joe Williams wrote:
>
>  Anything I can do to help this patch graduate?
>
> Thanks!
>
> -Joe
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Tuesday, June 14, 2011 at 12:26 PM, Joe Williams wrote:
>
>  Updated this branch, please refetch.
>
> git fetch git://github.com/joewilliams/otp.git release_handler_1
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Friday, June 10, 2011 at 8:52 AM, Joe Williams wrote:
>
>  Great, thanks!
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
> On Friday, June 10, 2011 at 8:51 AM, Raimo Niskanen wrote:
>
> On Thu, Jun 09, 2011 at 08:20:51AM -0700, Joe Williams wrote:
>
> Please fetch:
>
> git fetch git://github.com/joewilliams/otp.git release_handler_1
>
> This is a different branch with a better commit message and no white space
> changes.
>
>
> Excellent. I will include your patch in 'pu' after rewording the
> summary line to imperative form.
>
>
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
>
> On Thursday, June 9, 2011 at 7:44 AM, Joe Williams wrote:
>
> Nothing specific, just wondered if anyone had any thoughts on how I dealt
> with a couple of corner cases in installing releases.
>
> I'll fix things up and get back shortly.
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED (mailto:joe@REDACTED <joe@REDACTED>)
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
>
> On Thursday, June 9, 2011 at 12:11 AM, Raimo Niskanen wrote:
>
> On Wed, Jun 08, 2011 at 03:41:37PM -0700, Joe Williams wrote:
>
> Any thoughts/feedback on this patch? I realize it doesn't follow the
> guidelines (https://github.com/erlang/otp/wiki/Submitting-patches) exactly
> and will clean it up soon.
>
>
> Anything in particular? I just got caught up in tideous merge work
> yesterday and missed to include your patch in 'pu', I was about
> to take it now.
>
> But if you have a cleanup I can wait for it...
>
>
>
> --
> Name: Joseph A. Williams
> Email: joe@REDACTED (mailto:joe@REDACTED <joe@REDACTED>)
> Blog: http://www.joeandmotorboat.com/
> Twitter: http://twitter.com/williamsjoe
>
>
> On Tuesday, June 7, 2011 at 2:33 PM, Joe Williams wrote:
>
> git fetch git://github.com/joewilliams/otp.git (
> http://github.com/joewilliams/otp.git) (
> http://github.com/joewilliams/otp.git) release_handler
>
>
> _______________________________________________
> erlang-patches mailing list
> erlang-patches@REDACTED (mailto:erlang-patches@REDACTED<erlang-patches@REDACTED>
> )
> http://erlang.org/mailman/listinfo/erlang-patches
>
>
>
> --
>
> / Raimo Niskanen, Erlang/OTP, Ericsson AB
>
>
> _______________________________________________
> erlang-patches mailing list
> erlang-patches@REDACTED (mailto:erlang-patches@REDACTED<erlang-patches@REDACTED>
> )
> http://erlang.org/mailman/listinfo/erlang-patches
>
>
> --
>
> / Raimo Niskanen, Erlang/OTP, Ericsson AB
>
>
>
>
>
>
> _______________________________________________
> erlang-patches mailing list
> erlang-patches@REDACTED
> http://erlang.org/mailman/listinfo/erlang-patches
>
>
>
>
>
>
>
>
>  _______________________________________________
> erlang-patches mailing list
> erlang-patches@REDACTED
> http://erlang.org/mailman/listinfo/erlang-patches
>
>
>
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-patches/attachments/20110914/62ab0e71/attachment.htm>