[erlang-questions] Distributed Erlang Architecture

Thu May 21 17:11:07 CEST 2015

On Thu, May 21, 2015 at 9:48 AM, Chris Clark <boozelclark@REDACTED> wrote:
> Hi Garrett
>
> Thanks very much for the information that does make a lot of sense.
>
> I am trying to wrap my head around how large scale Erlang systems are
> structured in practice but I see that it will depend on the application and
> is only something worth thinking about once/if in becomes a real problem.
>
> What initially had me thinking about it was availability rather than scale.
> A lot of what I have read and seen in videos suggests avoiding hot code
> loading where possible and rather having multiple independent app instances
> that can just be swapped, one at a time with new instances (provided the app
> allows it).

On that point, if you can rid your application of state, it's a
trivial problem - you can have multiple services running that are
behind a router. The router becomes your point of failure, but that
role is presumably much simpler than your application and less likely
to be down, in theory.

Your best bet is to build failover logic into clients directly and let
them handle outages and re-routing.

If you have state to manage you have a harder problem - I can't say
anything specific other than it's hard.

To start though, I'd build your app without concern for availability
and get it doing something useful. To handle outages, focus on a fast
restore strategy to get your server back and running as quickly as
possible. If you focus on that problem you should be able to get a
server restored in under a minute. For many many many applications, a
minute of downtime in the event of a server failure is Outstanding.

The advantage of this approach is that it's generally acceptable and
avoids very complicated routing/replication/failover strategies needed
in faster recovery scenarios.

If you absolutely can't tolerate that sort of outage you'll need to go
down the hard road of building a more complex system. But you'll still
need to understand the specifics of your app and what it needs to run,
so I'd get it working first, then deal with advanced recovery
scenarios.

All that punting said, OTP has some recovery features that ostensibly
can be used for near-real-time service recovery, but I don't have any
experience with them. Others can weigh in.

Garrett