<div dir="ltr"><div>I struggle a lot with how to design erlang systems. Everything is easy and very powerful as long as you stay on one node. Supervision tree, and processes and all that. <br><br>However, to be fault tolerant you need at least three servers and here is where my problem comes in. All of a sudden the nice design is not so nice any longer.<br><br>gen_server is all about state. And if you want to be fault-tolerant this state must somehow be shared, or at least it is my assumption that it has to be shared. If not I'd be happy to hear about alternative approaches.<br><br>If state needs to be shared I only see two alternatives:<br><br>1) Push state to the database. To me this is an anti-pattern. All of a sudden I don't need gen_server's or supervisor's or anything because the state is fetched from the database anyway. So basically by pushing everything to the database I don't need erlang either. Pushing to the database can therefore not be the solution. <br><br>2) Implement some distributed protocols to solve these problems. Distribution however is not trivial and something you want to rely on robust libraries to do. As we know Erlang/OTP doesn't provide any except application failover which people seem not to recommend. I've found<br><br>* riak_core, which I find a little bit to coupled with riak to be optimal. I've played around with it and it sort of makes your entire system focus around riak_core.<br>* gen_leader which people in general seem very suspicious of<br></div><div>* locks_leader which may be an improvement on gen_leader but don't know how production ready it is<br></div><div>* a couple of raft implementations. Very new and haven't tried them out<br></div><div><br>and I guess one of the above libraries can be used to distribute state of every gen_server that needs it.<br></div><br><div>I have a feeling I am sort of blinded by traditional design and can't best see how to leverage Erlang and OTP. Perhaps I am being to strict in my requirements and that the system doesn't actually have to be always consistent and always running etc and I've had a few ideas on to to implement an ad-hoc, bug-ridden version of distribution that may solve my problems but it doesn't feel right.<br><br></div><div>Any insight (reading material, open source software) into how to design distributed, fault-tolerant systems with Erlang/OTP is welcome. <br><br></div><div>Cheers,<br></div><div>Martin<br></div><div><br><br></div></div>