[erlang-questions] Architectural quandaries

Jay Nelson jay@REDACTED
Thu Sep 11 16:28:52 CEST 2014


David Welton wrote:
> I'm still struggling a bit with some doubts about how to best
> architect a complex, real-world Erlang application.

Let it crash serves a purpose when you have complex code, but somewhere
at the base you want some sense of stability, especially in a loosely coupled
environment.

I would focus on a simple core that starts up and connects things. This core
should have a very reliable low level signal between server and tablet that
it is in one of a few states: started, connecting, disconnecting, running, etc.

When all else is going haywire, this core needs to at least indicate to the
tablet so you can give feedback to the user. For continuity, some data
should be cached on the tablet after connecting / running, so that intermittent
failures can be smoothed over in the user’s eyes. Just because the components
may be currently unstable or reconnecting doesn’t mean the user experience
has to be.

Recently I have been confronted with the excess complexity of our systems
and have been sketching out new approaches to bring up and down systems
in a more controlled manner (and dealing with multi-minute startups in a better
way). You should consider sketching out a supervisor hierarchy along the lines
of:

1) Rest_for_one root children
      - this guarantees startup ordering
      - allows later children to make assumptions

2) Start fast core things early
     - included base libraries with no real initialization
     - required ets tables (you might want to cache send to / recv from tablet events)

3) Start an FSM to control startup
     - use messages to this to start / stop children of other supervisors

4) Rest of the supervisors with no children
     - if possible one_for_one under a single last rest_for_one supervisor
     - add start_child / stop_child functions and call these from the FSM

Root (rest_for_one)
    - libs
    - pre-allocated resources
    - FSM
    - Services (one_for_one)
         - Service 1..N

5) Use a gen_event style mechanism to trigger the #3 FSM

Spawning deep supervisor trees and using start_phases makes for a slow
startup and worse situation on recovery (start_phases won’t run on recovery).
Put all start coordination in the FSM and allow for various startup orderings
by allowing service peers rather than strict hierarchies. That way one
service can go offline without degrading other services.

Use a protocol like UBF between server and tablet so you have a way of
wringing out all the protocol issues. Focus on just getting the server rock
steady with admin control (up/down services independently) and robust
pinging of the tablet with buffered events on both sides. Look at SSE-style
data exchange for lightweight continuous comms that startup and shutdown
quickly, and use this as a base prior to and below the UBF (which is for more
complicated transactions of the application). The base comms should be
one-way broadcast in both directions because you can’t know if the other
side is listening at all times.

Simulate hollow services with receive after Estimated_Compute_Time end
and focus on the connection behaviour before you write any application
logic.

Sounds like fun!

jay




More information about the erlang-questions mailing list