[erlang-questions] Erlang suitability

Sat May 19 09:24:17 CEST 2012

"As mentioned earlier in this thread, 75 servers is a bit much, but
people have done it before."

What I haven't seen mentioned so far: depending on your application,
reimplementing it in Erlang might mean that you no longer need nearly
as many as 75 servers.

-michael turner

On Fri, May 18, 2012 at 9:02 PM, Fred Hebert <mononcqc@REDACTED> wrote:
> Answers inline.
>
>
> On 12-05-18 5:00 AM, Ovid wrote:
>
> Hi there,
>
> We've a system that run across 75 servers and needs to be highly performant,
> fault-tolerant, scalable and shares persistent data across all 75 servers.
> We're investigating Erlang/Mnesia (which we don't know) because it sounds
> tailor-made for our situation.
>
> As mentioned earlier in this thread, 75 servers is a bit much, but people
> have done it before.
>
>
> We are not using Erlang for our first implementation, but are instead
> hacking together a solution from known technologies including Perl, MySQL
> and Redis. We're considering Erlang for our future work.
>
> We have two primary needs: Each box can bid on an auction and potentially
> spend a tiny amount of money and each of the 75 boxes will receive
> notifications of a small amount of money spent if they win the auction (the
> auction notification will probably not be sent to the box bidding in the
> auction).
>
> Use case 1: If the *total* of all of those small amounts exceeds a daily cap
> or an all-time cap, all 75 boxes must immediately stop spending bidding in
> auctions. It seems that each box can run a separate Erlang process and write
> out "winning bid" information to an Mnesia database and all boxes can read
> the total amount spent from that to determine if it should stop bidding.
>
> This seems trivial to set up.
>
> It isn't trivial. You have think about what happens when a box is seen as
> crashing. How strongly consistent do you want things to be? There is always
> a risk that a box didn't crash, but was cut off in a netsplit. You might get
> divergences in budget that will be hard to explain.
>
> There is also a definite timing issue depending on how your data is being
> observed. For example, you ask permission to bid on an item, but you do not
> get instant feedback; by the time you sent maybe 5-10 bids, the cap is
> finally reached and broken at once because the delay to the other network
> made you keep on bidding without a final result. How much tolerance do you
> have for this?
>
> You mentioned in another post that "We need to ensure that were all 75 boxes
> to mysteriously crash, we could bring them back up and not worry about data
> integrity.", Possibly, but what about 1 node only? What about 5? What about
> 30 or 35? What if they crash and you missed winning bids because you went
> out after bidding but before getting your notifications back (if that is
> possible by the bidding rules of whatever exchange you're dealing with).
>
> The most solid synchronous database setup might not give you the guarantees
> you expect in the first place.
>
>
> Use case 2: we periodically need to reauthenticate to the auction system. We
> MUST NOT have all 75 boxes trying to reauthenticate at the same time because
> we will be locked out of the system if we do this. Having a central box
> handling reauthentication is a single point of failure that we would like to
> avoid, but we don't know what design pattern Erlang would use to ensure that
> only one of the 75 Erlang instances would attempt to reauthenticate at any
> one time (all 75 boxes can share the same authentication token).
>
> That depends on: 1. how many times you can try to re-authenticate before
> being blocked, 2. how close together they have to be.
>
> Central points of failures are definitely something to avoid. Leader
> election across 75 boxes might not be the funnest thing in the world either.
> I could see a scheme where you use some distributed cached value that can
> say "I am currently being logged" that can time out at some point, visible
> to all readers. When you read that timeout value from each box (possibly
> from an OTP Application that only handles auth), each reading of that value
> adds or subtracts a random number to the timeout. This is to try and avoid a
> cluster-wide synchronization on the timeout value, and instead have them
> happen at different times. You could add an "I'm updating" flag related to
> that value and that could give you good probabilities that only a fraction
> of all the nodes attempt an authentication at any point in time close to the
> timeout value.
>
> Again, this would depend on how often your authentication needs to be done,
> and to what frequency you're allowed to do it.
>
> If it's too tight, you might need a central server or node that takes care
> of it, with one or two fail-overs to add some reliability.
>
> Note you will still have to care about netsplits ruining your day with this
> whole scheme.
>
> -- I had nothing to add on the rest of the mail, so cut if off.
>
> Hope this helps,
> Fred.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>