[erlang-questions] Help with design of distributed fault-tolerant systems

Thu Oct 8 03:53:03 CEST 2015

I struggle a lot with how to design erlang systems. Everything is easy and
very powerful as long as you stay on one node. Supervision tree, and
processes and all that.

However, to be fault tolerant you need at least three servers and here is
where my problem comes in. All of a sudden the nice design is not so nice
any longer.

gen_server is all about state. And if you want to be fault-tolerant this
state must somehow be shared, or at least it is my assumption that it has
to be shared. If not I'd be happy to hear about alternative approaches.

If state needs to be shared I only see two alternatives:

1) Push state to the database. To me this is an anti-pattern. All of a
sudden I don't need gen_server's or supervisor's or anything because the
state is fetched from the database anyway. So basically by pushing
everything to the database I don't need erlang either. Pushing to the
database can therefore not be the solution.

2) Implement some distributed protocols to solve these problems.
Distribution however is not trivial and something you want to rely on
robust libraries to do. As we know Erlang/OTP doesn't provide any except
application failover which people seem not to recommend. I've found

* riak_core, which I find a little bit to coupled with riak to be optimal.
I've played around with it and it sort of makes your entire system focus
around riak_core.
* gen_leader which people in general seem very suspicious of
* locks_leader which may be an improvement on gen_leader but don't know how
production ready it is
* a couple of raft implementations. Very new and haven't tried them out

and I guess one of the above libraries can be used to distribute state of
every gen_server that needs it.

I have a feeling I am sort of blinded by traditional design and can't best
see how to leverage Erlang and OTP. Perhaps I am being to strict in my
requirements and that the system doesn't actually have to be always
consistent and always running etc and I've had a few ideas on to to
implement an ad-hoc, bug-ridden version of distribution that may solve my
problems but it doesn't feel right.

Any insight (reading material, open source software) into how to design
distributed, fault-tolerant systems with Erlang/OTP is welcome.

Cheers,
Martin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20151008/feac997a/attachment.htm>