[erlang-questions] Architectural quandaries

Tue Sep 16 12:14:09 CEST 2014

Hi,

On Mon, Sep 15, 2014 at 10:00 PM, Jay Nelson <jay@REDACTED> wrote:
> David N. Welton wrote:
>
>> So you would advocate putting everything in one Erlang ‘application'
>> in order to take advantage of the restart capabilities such as
>> rest_for_one?  I was actually moving to break things up into separate
>> applications with different git trees and everything, so that things
>> could be developed in a more independent way: for instance, the report
>> generation software gets its own application, separate from the
>> hardware control software.

> Don’t confuse development with operations. If you like, you can make
> separate applications in separate github repos. That may make it easier
> to test each component. I would do that and have separate PropEr
> test suites run by common_test for each one (that’s my current style).
>
> It means managing separate repos, but if the components are generally
> useful, it makes it convenient for others.
>
> In operations I would have one application that uses included_applications and
> starts the root supervisor of each of your other components in the correct
> dependency and startup sequence. Especially if you are writing all the
> components, you will be intimately familiar with the dependencies and
> start up behaviour of each one. Of course, this overlord application is the
> real application you started talking about and would be a separate repo
> of its own.

Aha!  I had missed included_applications, and indeed, that looks like
a potentially good way of having both the separate applications as
well as the supervision tree.

It seems that not everyone is in favor of these:
http://learnyousomeerlang.com/the-count-of-applications#included-applications
- and I can see that more tightly coupling things is potentially
problematic.  Realistically though, a lot of our code won't be used
without all the other things present either.

> If you have several applications, rather than using included_applications,
> you will have the possibility of a component failure which is undetected
> and will not restart without manually restarting or writing your own code
> to monitor and manage them.

>> I was starting to think along the lines of a centralized system for
>> monitoring some of these applications...
>
> Hmm. I prefer to use the OTP tools that are present, and use them to my
> benefit to avoid such circumstances. Splitting into independent applications
> defeats all the restart facilities of OTP, unless you use heart and make
> them all permanent applications and are willing to wait for VM restarts
> when things start to go sideways…

Yes, that's part of what I'm after: how to keep things within OTP as
much as possible.

After thinking things through some, though, and after Fred Hébert
kindly took the time to discuss some of this with me on #erlang, I
have come to the conclusion that:

OTP alone is not up to the task - there has to be some kind of extra
layer or extra logic in there to deal with systems that might not be
functioning.

Perhaps this provokes a reaction in the reader along the lines of "he
has a firm grasp of the obvious", but after drinking the OTP cool-aid,
going outside it feels like "I wonder what I'm doing wrong or what I'm
missing - they must have something for this, right?".

Take, for instance, the hardware in our system - it shouldn't fail,
and the system will not work as advertised if it does. *However*,
sooner or later, it probably will fail somehow, and the system needs
to stay up to aid the user in running diagnostics.  Simply including
the hardware in the supervision tree leads to things gradually falling
over in an unacceptable way.

Fred talks about these concepts some here:
http://ferd.ca/it-s-about-the-guarantees.html

To my way of thinking, it really seems like there should be something
more out there in Erlang land for these situations; something that
intermediate people like myself can easily find and make use of and
feel confident we're doing the right thing.

* Better documentation, at least.  I think "the database for a web
site" provides a great example.  The web site should not fall over
when the DB becomes unavailable.  Code should be included.  We hear
plenty about letting it crash, but there's a significant number of use
cases where no, it's actually more complex than that.

* Some kind of gen_transient_service that gathers up the best
practices and is a "good enough" solution in many cases.  This would
help for the "low level" case of a specific resource.  It could come
with a couple of strategies, and perhaps be pluggable in order to
include more... things like exponential backoff.   A lot of this code
has to look pretty similar: have the connection status in the state,
return errors if it's not connected, have a fast init as well as a
callback that attempts the connection, and then whatever strategy to
handle errors with the connection.  Wrapping it up in a library seems
possible even if it doesn't cover every corner case out there.

* Perhaps some kind of application manager.  I'm actually thinking of
writing code along these lines, as the above is too specific in our
case (I think, at least).  Our hardware management stuff has a variety
of programs that it takes care of, and having client portions of our
code know about all of them is probably not a good idea.  I'd rather
just have the hardware application go down and have our application
manager alert the user, and keep track of what's running: "the
hardware system is up, but the report generation system is down".  I'm
still trying to work out in my head if this is a good idea or not
though.... perhaps the gen_transient_service thing is better.

Thoughts?

Thanks again for reading, and apologies if my normally muddied
thoughts are more silted up than usual; I'm a bit short on sleep.

-- 
David N. Welton

http://www.welton.it/davidw/

http://www.dedasys.com/