clusters

Hakan Mattsson <>
Mon Oct 31 11:43:00 CET 2005


Take a look at the "linear hashing" algorithm and its
relative "linear hashing star". With linear hashing you
may add nodes smoothly without rebuilding the entire
database.

We are using that for fragmented tables in Mnesia:

  http://erlang.se/doc/doc-5.4.8/lib/mnesia-4.2.2/doc/html/part_frame.html

When you add a new fragment you only need to split one
of the old fragments, regardless of the total number of
existing fragments.

If you plan to outgrow Mnesia, it might be a good idea
to customize the hash algorithm for fragmented tables
with one that fits your needs:

  http://erlang.se/doc/doc-5.4.8/lib/mnesia-4.2.2/doc/html/application_frame.html

I don't know how the "chord" algorithm works, but if it
can handle addition of new nodes smoothly, it might be a
good candidate for your customized mnesia_frag_hash
implementation.

/Håkan

On Fri, 28 Oct 2005, Renyi Xiong wrote:

RX> Date: Fri, 28 Oct 2005 15:37:53 -0700
RX> From: Renyi Xiong <>
RX> To: 
RX> Cc: , 
RX> Subject: RE: clusters
RX> 
RX> Hello Joe,
RX> 
RX> If I understand correctly, we need to rebuild the whole mnesia
RX> database each time we add a new node pair. Cause the hash key
RX> is dependant on the number of nodes. Is that right?
RX> 
RX> Renyi.
RX> 
RX> 
RX> > From: "Joe Armstrong (AL/EAB)" <>
RX> > To: "Renyi Xiong" <>
RX> > CC: <>
RX> > Subject: RE: clusters
RX> > Date: Mon, 24 Oct 2005 09:59:22 +0200
RX> > 
RX> > Hello Renyi,
RX> > 
RX> > Interesting question - I'll give a short answer (actually why not post
RX> > this to the
RX> > Erlang list - (to join the list follow the instruction in
RX> > http://www.erlang.org/faq.html)
RX> > 
RX> > I've no idea what the windows 2003 clusting service is :-)
RX> > 
RX> > Firstly - let E = # exposed servers. I = # internal servers U = # users
RX> > 
RX> > questions
RX> > 
RX> > 	- is E + I large
RX> > 	- is U  very large (ie outside the mnesia adress space?)
RX> > 	- how many U's/machine do you allocate
RX> > 
RX> > IMHO you can get a long way with a pool of PC's - assume a transaction
RX> > takes
RX> > 50 ms. CPU - then you can do 1,7 M transactions/day. So if we have 1.7 M
RX> > users
RX> > doing one transaction/day then if each needs (say) 10KB data you'd need
RX> > 17G of data.
RX> > 
RX> > ie a low-end PC (1 Gmemory, 2GHz processor, 80 G disk) could easly handle
RX> > (say) 1.5M users
RX> > 
RX> > Now you need at least TWO PC's (fault-tolerence)
RX> > 
RX> > So if you make them in pairs each pair can handle 1.5M users - use a
RX> > replictaed mnesia
RX> > disk/ram table.
RX> > 
RX> > Now you want to scale up ...
RX> > 
RX> > Easy.
RX> > 
RX> > The unit of scaling is the pair I have just described.
RX> > 
RX> > Call these pairs P1, P2, P3, ..... In each pair the machine with the
RX> > lowest IP is the
RX> > primary - the other is the take-over machine.
RX> > 
RX> > Assume a user makes a HTTP request to the primary in ANY pair - all you
RX> > now need to
RX> > do is figure out which of the Pairs P1 .. Pn is "the correct machine" (ie
RX> > the one that stores their data) - then send them an HTTP re-direct to the
RX> > correct machine.
RX> > 
RX> > If the address space is small you can just use a ram-replicated mnesia
RX> > table for the
RX> > redicrection table.
RX> > 
RX> > If it is very large use consistent hashing. Call the IP address of the
RX> > primaries in
RX> > in the pairs Ip1, Ip2, ... Ipn. Assume the user Key is K.
RX> > 
RX> > Compute hash values of Ip1, Ip2, ... K using some hash algorithm. Say
RX> > md5(X) mod 2^32
RX> > 
RX> > Call theses IpH1, IpH2, .... KH - now the data corresponding to key K is
RX> > found on the
RX> > machine with hash IpHk where k is the smallest value in IpHk such that
RX> > IpHk > KH
RX> > 
RX> > (look up the "chord" algorithm for details)
RX> > 
RX> > - here's what I'd do
RX> > 
RX> > Phase A
RX> > 	- build a basic pair of processors (as I have described)
RX> > 	- deploy it (it will take some time to get millions of customers)
RX> > 
RX> > Phase B
RX> > 	- when you get more customers build more pairs
RX> > 	- user mnesia and a ram replicated dispatch table
RX> > 
RX> > Phase C
RX> > 	- when you get outside the addressing limits of mnesia (G users)
RX> > 	- make a layer with consistent hashing to replace the mnesia
RX> > replicated table
RX> > 
RX> > I hope you make it to C
RX> > 
RX> > /Joe
RX> > 
RX> > 
RX> > > -----Original Message-----
RX> > > From: Renyi Xiong [mailto:]
RX> > > Sent: den 22 oktober 2005 04:54
RX> > > To: Joe Armstrong (AL/EAB)
RX> > > Cc: 
RX> > > Subject:
RX> > >
RX> > >
RX> > > Hello Joe,
RX> > >
RX> > > I'm a programmer working for Brian. I have a question for you
RX> > > in terms of
RX> > > concurrent programming.
RX> > >
RX> > > On client side, customers only see fixed number of servers
RX> > > based on IP
RX> > > addresses. My understanding is these exposed servers are
RX> > > listening for
RX> > > client requests, dispatching transactions to internal
RX> > > variable number of
RX> > > ERLANG servers, collecting replies and forwarding them to clients.
RX> > >
RX> > > So one of our jobs here is to write an ERLANG program to
RX> > > implement a kind of
RX> > > clustering service or ERLANG already has such kind of server
RX> > > included?(like
RX> > > WIndows 2003 clustering service?)
RX> > >
RX> > > Thanks,
RX> > > Renyi.


More information about the erlang-questions mailing list