[erlang-questions] "How to build reliable system" literature

Tue Aug 7 10:29:14 CEST 2007

Hello, all!

We are building distributed system (using Java, sorry :-)) and the
system is expected to have the following properties:
1. It consists of several (>=5) quite distributed servers (500-1500 km),
each of them is mirror of another (in the future probably servers will
become small clusters).
2. It expected to have throughput at least 4000-5000 messages/sec
(messages are short in general, like new coordinates of something, power
on/off etc).
3. It expected to be as reliable as possible, this includes guarantee
delivering, handling failover without losing active sessions, add/remove
server on the fly and so on.

"Guarantee delivering" buzzword in our case is message protocol of
sending messages with acknowledgement and this is not very happy
solution - problems arise with growing queues of pending
messages/acknowledgements and performance. Yes, the throughput is not
terribly huge, but still considerable.

Mirroring also cause a number of problems with merging replicas (though
on this theme I've found "Optimistic replication" paper by Y.Saito and
M.Shapiro).

So, I wonder how does "The Ideal System" of such type looks like, if we
assume that communication between servers is not reliable.

Hence, my question (I suspect Erlang developers are familiar with the
problems above) is to find good books and papers on the themes:
1. Finding optimal protocol to provide best reliability with given
performance bound.
2. Maybe there exists better literature on replication.
3. Performance bottlenecks, architectural problems and other
difficulties in distributed systems.
4. Personal experience of others who have built such systems.
5. Anything you consider has close relation with the task.

Thanks everyone in advance,
Nick
---
%1% Yes, I've read the "Four-fold Increase in Productivity and Quality"
%2% Yes, I've read the Joe's thesis, but want more :-)