[erlang-questions] gen_server clustering strategy

Wed Feb 2 21:27:37 CET 2011

Thanks for all the great feedback!

To clarify some points:

 * We need to be able to send messages from every process to any other
process, so we could use some form of registry. Using the global
module would get us going, but not take us all the way (which is
something we can worry about later).

 * We have considered writing our own process registry, based on ets
or mnesia or redis or whatever to perfectly support our use case. We
are however interested in standing on the shoulders of giants,
especially if we can use a library like gproc, which has been
extensively tested and verified(at least without distribution).

 * We won't hit 10 million users anytime soon, however 2 million daily
users four months after marketing starts marketing is realistic. Also,
our system cannot break when doubling that number. This means we need
to be able to handle at least 5000 requests per second. The numbers
are from one of our other products, so we know pretty well what to
expect.

 * We would like to add nodes without disturbing the processes already
running, so consistent hashing must be used if hashing

Our worry with the cluster of identical nodes, the "mesh" described
earlier, is that there will be too many messages passing around. As we
add nodes, it will only get worse and we might start hitting the
limits of the Erlang distribution itself.

Our worry with the cluster that includes "proxy" nodes, is that they
must maintain the state of the cluster in order to route accordingly.
At least they must know where to send the requests using some form of
hashing. To this end, building on top of riak_core might be a very
good option. However, we feel that riak_core is pretty complex to
understand fully and would prefer a simpler solution.

As Matt suggested in his second post, we can make the system simpler
by not keeping a global registry of userid to process mapping, but
only keep a mapping of userid to the node the process is running on.
The node can then itself store the userid to process mapping using
gproc or something similar locally.

If riak is running on 60 nodes, then the question for us is how much
message passing is going on. Our concern is that when we reach 10 or
20 nodes, our architecture will just not cut it anymore as we will
increase the message passing.

Also, what kind of overhead can we expect if we monitor processes on
other nodes? Our understanding is that the information regarding pids
that die, will be sent to any node in the cluster anyway, so actually
monitoring and receiving the info won't make that much of a difference
as it is already there.

Thanks for all the great feedback!

Regards
Knut

On Wed, Feb 2, 2011 at 8:17 PM, Anthony Molinaro
<anthonym@REDACTED> wrote:
> Riak Core actually does a bunch of this stuff for you
>
> http://blog.basho.com/category/riak-core/
> https://github.com/basho/riak_core
>
> It will manage a partitioned ring sending reads and writes to some number
> of nodes, and is great a library.  We use it as a bridge between a thrift
> server and a custom storage module, and it allows us to scale out as
> necessary.
>
> Also according to riak FAQ, the largest cluster they've run is about 60
> nodes
>
> http://wiki.basho.com/FAQ.html#What-is-the-largest-cluster-currently-running?
>
> so that's at least one example of scaling into double digits.
>
> -Anthony
>
> On Wed, Feb 02, 2011 at 11:40:25AM -0500, Evans, Matthew wrote:
>> My worry is the OP is suggesting using the global name server. This has a few problems
>>
>> 1) That he will soon run out of the atom table space (I think he's making the assumption that gen_servers need to be registered, which is not the case).
>> 2) Sharing process state information between all nodes
>> 3) All all lookups need to go to a central gen_server (the global name-server) running on a single logical core.
>>
>> (I might be wrong about point 3).
>>
>> But actually a hashing function could be a good idea, at least to find the correct node. This could negate the need to share process availability information between host/node boundaries.
>>
>> One could maintain a list of current nodes in the mesh, and then use a hash to find the correct node. Identifier to pid still needs to be mapped, but this can be done by a local ets table (optimized and fast for lookups).
>>
>> For example:
>>
>> process_inbound_request(Message,Identifier) ->
>>     Nodes = nodes()++[node()],
>>     Id = erlang:phash2(Identifier, length(Nodes)),
>>     case lists:nth(Id+1,Nodes) of
>>         node() ->
>>              % On same host, lookup locally to find the correct instance
>>              dispatch_locally(Message.Identifier);
>>          Node ->
>>              dispatch_remotely(Message,Identifier)
>>     end.
>>
>> dispatch_locally(Message,Identifier) ->
>>     case ets:lookup(local_pids, Identifier) ->
>>         [{_,Pid}] ->
>>             gen_server:cast(Pid,{inbound,Message});
>>        _ ->
>>             ok;
>>      end.
>>
>> dispatch_remotely(Message,Identifier) ->
>>     gen_server:cast({local_proxy,Node},{request,Message,Identifier}).
>>
>> local_proxy is a locally registered gen_server on each node that implements the equivalent of the dispatch_locally function.
>>
>> Matt
>>
>> ________________________________________
>> From: James Churchman [jameschurchman@REDACTED]
>> Sent: Wednesday, February 02, 2011 10:37 AM
>> To: Evans, Matthew
>> Cc: Knut Nesheim; erlang-questions@REDACTED
>> Subject: Re: [erlang-questions] gen_server clustering strategy
>>
>> all great questions, most of which i don't know the answers to, but queuing systems can help if a bit of extra latency is ok, and more so if the order does not have to be exact
>>
>> i think at very large scale hashing becomes the only answer, and handling failed nodes becomes tricky, but a basic distributed mnesia table that contains the number of alive nodes and maybe reports on their repeated failures, and then hashes the input (on userid in your case) to the correct node would get you half way there.
>>
>> Also i don't think that 100,000 processes that peak at 300kb is a problem for erlang. Thats only 30GB max. If its stored as binaries, and possibly after a really large message you invoke the garbage collector for that process ( i have no idea if this is a bad idea or not, but can be done easily and should reduce memory consumption) then a single box should be able to handle your needs easily, and 3 no probs at all. My mac pro should be ok :-)
>>
>> i think dedicated hardware is always better than virtualised
>>
>> As for 1000 messages at 300kb, that seems like quite a lot, bordering on what standard gigabit ethernet can handle,  but just benchmark erlang, again one box might be enough... you can get a 32 core amd server for not much these days. Are these requirements realistic tho, or just "if 10 million people use my service on day one that i have put together on a shoe string etc... wishful thinking"
>>
>>
>> On 1 Feb 2011, at 19:10, Evans, Matthew wrote:
>>
>> > Without knowing much more another model would be to use your own ets/mnesia table to map the workers.
>> >
>> > I am assuming the creation of a worker is relatively infrequent (compared to the times you need to find a worker).
>> >
>> > When you create a worker use gen_server:start(?MODULE, [], []).
>> >
>> > This will return {ok,Pid}. You can save the Pid in an mnesia table along with whatever reference you need (doesn't need to be an atom). All nodes in the mesh will get the pid and the reference. Starting it as shown above (with an arity of 3)  means you don't need to use the global service and ensure the names are unique atoms for each server. You can even do start via some application that will spread the gen_servers over different nodes.
>> >
>> > When a request comes in it only need do a lookup to find the correct pid.
>> >
>> > I can't recall as to how bad the process crash messaging is. But what you could do is to monitor a gen_server locally, and monitor each node globally. When a node fails all other nodes can cleanup processes that were registered on that node.
>> >
>> > Matt
>> >
>> > ________________________________________
>> > From: erlang-questions@REDACTED [erlang-questions@REDACTED] On Behalf Of Knut Nesheim [knut.nesheim@REDACTED]
>> > Sent: Tuesday, February 01, 2011 12:25 PM
>> > To: erlang-questions@REDACTED
>> > Subject: [erlang-questions] gen_server clustering strategy
>> >
>> > Hello list,
>> >
>> > We are interested in understanding how the Erlang distribution
>> > mechanism behaves in a fully connected cluster. Our concern is that as
>> > we add nodes, the overhead will become a problem.
>> >
>> > Our application is a bunch of gen_servers containing state, which we
>> > need to run on multiple nodes due to memory usage. A request will come
>> > to our application and it needs to be handled by the correct
>> > gen_server. Every request includes a user id which we can use to map
>> > to the process.
>> >
>> > We have two specific use cases in mind.
>> >
>> > 1) Proxy to workers
>> >
>> > In this scenario we have a proxy (or multiple proxies) accepting
>> > requests at the edge of our cluster. The proxy will ask the correct
>> > server for a response. Either through using the 'global' module or
>> > gproc or something similar, the proxy node will keep track of the
>> > mapping of user ids to process and nodes. The proxy will call the node
>> > using the Erlang distribution protocol.
>> >
>> > 2) Mesh
>> >
>> > In this scenario a request can be handled by any node in the cluster.
>> > If the request cannot be handled by the local node, the correct node
>> > is called on instead. Every node in the cluster needs to keep track of
>> > which id belongs to which process.
>> >
>> >
>> > The numbers:
>> > * We must be able to run 100,000 processes, each may peak at 300kb of memory
>> > * We expect 5000 requests coming in to our application per second at
>> > peak, generating the same number of messages
>> > * Out of those 5000 messages, 1000 has a content that may peak at 300kb
>> > * All the other messages are well under 1kb
>> >
>> > Our concerns:
>> >
>> > * In the mesh, will the traffic between the nodes become a problem?
>> > Lets say we have 4 nodes, if the requests are evenly divided between
>> > the nodes the probability of hitting the correct node is 25%. With 100
>> > nodes this is 1%. As we add nodes, there will be more chatter. May
>> > this chatter "drown-out" the important messages, like pid down, etc.
>> >
>> > * Will it be more efficient to use the proxy to message every node?
>> > Will the message then always go to the node directly, or may it go
>> > through some other node in the cluster?
>> >
>> > * For the mesh, we need to keep the 'global' or gproc state
>> > replicated. With 'global' we will run into the default limit of atoms.
>> > If we were to increase this limit, what kind of memory usage could we
>> > expect? Are there any other nasty side-effects that we should be aware
>> > of?
>> >
>> > * In the "Erlang/OTP in Action" book it is stated that you may have a
>> > "couple of dozen, but probably not hundreds of nodes." Our
>> > understanding is that this is because of traffic like "this pid just
>> > died", "this node is no longer available", etc.
>> >
>> > * If we were to eliminate messaging between nodes, how many nodes
>> > could we actually run in a fully connected cluster? Will a
>> > high-latency network like ec2 affect this number or just the latency
>> > of messages? What else may affect the size?
>> >
>> >
>> > Regards
>> > Knut
>> >
>> > ________________________________________________________________
>> > erlang-questions (at) erlang.org mailing list.
>> > See http://www.erlang.org/faq.html
>> > To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>> >
>> >
>> > ________________________________________________________________
>> > erlang-questions (at) erlang.org mailing list.
>> > See http://www.erlang.org/faq.html
>> > To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>> >
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym@REDACTED>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>