[erlang-questions] Architectural quandaries

Thu Sep 11 15:35:04 CEST 2014

Hi,

I'm still struggling a bit with some doubts about how to best
architect a complex, real-world Erlang application.  Sorry for the
long and somewhat rambling post, but it helps a bit just to write down
what's going on.

Dramatis personæ:

* A hardware system, which consists of some C++ programs managed by
Erlang.  Without these programs, the machine  cannot do the job it was
built for.

* A tablet, which acts as the user interface, and talks with Erlang
via the Erlang protocol and HTTP, where appropriate.

* A Postgres database.

* Various and sundry other subsystems, such as one for generating reports.

* A service application (in the Erlang sense) that can be used to
diagnose problems, run scripts on the underlying Linux system, and so
on.

Anyone who's been around Erlang any length of time knows that the idea
is to "let it crash!", though it may strike fear into the hearts of
managers listening in on the conversation.

Clearly though, the answer has to be more complex than that.  For
instance, in our case, if the hardware has gone haywire, one approach
would be to keep trying to kick off the programs managing it, and as
that repeatedly fails, propagate the error up the supervision tree,
until the whole node falls over and is restarted by the heartbeat
script.  By the way: we don't care too much about having tons of
uptime or high availability or that sort of thing.

Perhaps, when we let it crash, if the code to communicate with the
tablet hasn't already been torn down, the system will manage to leave
a hurried, scribbled note to the tablet explaining that it's in the
process of crashing and will restart.  At this point though, the
operator at the tablet is going to keep getting these "CRASHING,
SORRY, BYE" messages and only interact with the system for a limited
time before it all falls down again (which it is likely to, if the
problem persists), which sounds like something bound to be
frustrating.

So, clearly the hardware system should be isolated, so that its
thrashing around with an unforeseen problem will at least leave the
system available to do things like print reports for data already
generated, and interact with the database to retrieve stored data,
even if in its degraded state, it can no longer use the hardware to
acquire new data.  Importantly, the end user in front of the tablet
might be able to push some buttons to run diagnostics in order to help
determine the problem.

I've read some of "JLOUIS' Ramblings" on error kernels:
http://jlouisramblings.blogspot.it/2010/11/on-erlang-state-and-crashes.html
- a good read, but a bit on the philosophical side.  I need to
translate the ideas into working Erlang code.  Several things that
come to mind:

* The hardware control Erlang application, which is currently
'permanent', is switched to 'temporary', and some *other* code tries
to restart it when it crashes, and reports on that to the tablet
interface.  It would be necessary to write this code, because it's not
something provided by OTP.  I get a little bit of a feeling,
considering this idea, that "if it's not in OTP, maybe it's not a good
idea".

* Somewhere, in the hardware code's supervision tree, we put a
supervisor that has MaxR and  MaxT values like 100000000, 1 (speaking
of which, wouldn't it be nicer to just have an atom that like
always_restart_never_crash), and the interface code keeps tabs of
what's happening and can act as a consequence.

* We could employ, in some way, another node.  I'm less enthusiastic
about this, because we don't need any kind of high availability, and
because it would most likely complicate things more than we need.  One
of the above solutions seems like it ought to be workable.

To anyone who's made it to the end: a free beer if you ever happen by
Padova, Italy.

Thanks!
-- 
David N. Welton

http://www.welton.it/davidw/

http://www.dedasys.com/