[erlang-questions] Help with design of distributed fault-tolerant systems

Thu Oct 8 12:46:23 CEST 2015

Hi,

Thanks for your email. Very helpful and I'm going to give you a lengthy
reply in return.

> * What is your state? Do you really/why do you need it always
available/consistent?
This is a good question and got me thinking on how best to reply. While
thinking I looked at the problem in a different way.

Bascially we have the following data:
* Generated data and customer data.Stored in key-value database (mnesia or
riak). Should ideally be consistent but we are fine with riak's eventual
consistency here. If we have acknowledged that we have received data to a
particular device it must be there when the device requests it (which
usually happens at most seconds after). Some customer data can be lost
without implications whereas loss of other data would lead to loss of
service to a particular device.

* Transaction Logs. Changes to data-base data is stored in transaction logs
and replicated to external systems. We must capture all data but it doesn't
have to be always available as long as no data is lost. This is not a
real-time sync and larger delays are tolerated depending on use case.

* Configuration data. A current configuration which applies at a specific
time. This data changes periodically (by timed tasks) and by operational
input. Changes perhaps 10 times daily. Must be available, can handle short
inconsistencies (i.e a node uses a few minutes stale data is OK).

> A very significant number of reliable* distributed applications do not
> need to consistently share state. Only that 1%** does, and that's
> difficult, but usually an application is in the 99%-pool. Maybe your
> application is there too?
Yes, for a number of our sub-component (perhaps all) I think this would be
the case.

> If you told us a bit more about the application you're building, you
> would very likely receive more to-the-point and helpful responses. :-)

Well, you asked for it ;)

The system generates and imports data from various sources which it serves
to millions of devices. The data is requested by the clients and served
through HTTP. It has to be highly available with low latency and handle
fair bit of concurrency (hence erlang).

The generated data is also replicated to other systems (both for multi-data
center replication) and as export to independent systems.

It is all hosted and operated by our customers. I think this is important
to mention because from an operations point of view the system needs to be
very simple and all fault-tolerance should ideally be automatic, which also
means it is hard to use separate products for each and every need. For
example we cannot afford a tech-stack like: Postgres + Riak + RabbitMQ +
Redis + HAProxy  or similar as the operational overhead is too much. We do
support either riak or mnesia as a database and that is about it. Then our
product needs to take care of the rest.

The system contains a number of sub-systems, and I guess I've struggled to
some degree with multi-node on each and everyone of them. It is so easy to
start with a gen_server for prototyping but moving to multi-node from there
has been hard.

*) Process register. Each client gets its own process which is held in a
process register. Started as a gen_server, then ets for better performance
and then onto mnesia to get distribution.
I'd really like to implement this with consistent hashing and keeping the
primary and secondary node list in some shared state. Sounds a lot like
riak_core to me. It is fine for a process to crash and die but we need to
avoid data inconsistencies that could happen if two processes were started
on different nodes (during a net-split say) and then one is persisted
meaning all the other data is lost. In this case it is better to lose both.

*) Periodic tasks. We have a fair bit of background tasks that need to be
run.  Half of these need to be global. I.e they should run on an interval
but ideally only one node.  Currently each task is completely independent
and runs in its own process. For global tasks we simply run them on one
node and disable them on the others. Not ideal from a fault-tolerant point
of view. If we would've hosted it our-selves I could be fine with such a
solution but we need automated failover. I've played around with gen_leader
and locks_leader to have a distributed task list but don't know if this is
the right approach.

*) Routing. I have a gen_server router which re-directs request to the
specific data source. The gen_server process should run on every node and
must share state. The state doesn't change very often though but is dynamic
and is dictated by external system (which tells us to change this once in a
while).

*) Caching. Costly computations are cached in a process per "Id". It is
cached on the node where it is requested, however it can be requested on
multiple node and when a state change is triggered (which can happen on any
node) all nodes must be notified. Currently we use pg2 process groups and
send the new state to each process. Of course a bit ad-hoc and if a message
doesn't get there they'll be out of sync.

*) Message Queues or transaction logs.  Data must be replicated to
distributed (not via distributed erlang though) systems, which means we
need to store some amount of transactions. Data is immutable and is always
only appended to. Transactions can happen on any node. Also started as a
gen_server but moved into mnesia for distribution. I've been looking for
OTP application's which already are a persistent distributed message queue
but haven't found any (and as mentioned an external MQ is probably not
doable). Here I thought I needed strict ordering but I don't really as long
as I can guarantee that all the data eventually arrives at its destination.

Again, we do have sort of working solutions for the above but I feel they
are inconsistent and not robust enough. It is hard to fit all pieces above
into a coherent system.

>Often it's about making small compromises in the
>system you're building (which, turns out, don't matter for the users)
Fully agree.

Hopefully this gives you an idea of what the system need to do.

Cheers,
Martin

On Thu, Oct 8, 2015 at 8:23 PM, Motiejus Jakštys <desired.mta@REDACTED>
wrote:

> On Thu, Oct 8, 2015 at 3:53 AM, Martin Karlsson
> <martin@REDACTED> wrote:
> > I struggle a lot with how to design erlang systems. Everything is easy
> and
> > very powerful as long as you stay on one node. Supervision tree, and
> > processes and all that.
> >
> > However, to be fault tolerant you need at least three servers and here is
> > where my problem comes in. All of a sudden the nice design is not so nice
> > any longer.
> >
> > gen_server is all about state. And if you want to be fault-tolerant this
> > state must somehow be shared, or at least it is my assumption that it
> has to
> > be shared. If not I'd be happy to hear about alternative approaches.
>
> A very significant number of reliable* distributed applications do not
> need to consistently share state. Only that 1%** does, and that's
> difficult, but usually an application is in the 99%-pool. Maybe your
> application is there too?
>
> Think about:
> * What is your state? Do you really/why do you need it always
> available/consistent?
> * How do you handle updates to the state? Often it's possible to push
> the state consistency problem away from your service -- e.g. the
> client (multi-homing or sending full batches) or somewhere downstream.
>
> If you told us a bit more about the application you're building, you
> would very likely receive more to-the-point and helpful responses. :-)
>
> Also, a book with The Right Questions (not necessarily for Erlang)
> would be interesting. Often it's about making small compromises in the
> system you're building (which, turns out, don't matter for the users)
> for simplicity of the design and implementation (e.g. making it
> non-shared-state).
>
> [*]: that can handle failure of any single server.
> [**]: number made up of course, but my feeling is that it's really short.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20151008/166b4c99/attachment.htm>