[erlang-questions] My frustration with Erlang
Bengt Kleberg
bengt.kleberg@REDACTED
Mon Sep 15 08:17:20 CEST 2008
Greetings,
Sometimes it is possible to ask for new cards.
bengt
On Fri, 2008-09-12 at 14:11 -0400, john s wolter wrote:
> Joel,
>
>
> Like any technical project you are dealt a hand and you have to play
> it.
>
> On Fri, Sep 12, 2008 at 9:52 AM, Joel Reymont <joelr1@REDACTED>
> wrote:
> I sell a poker server written in Erlang. It's supposed to be
> super-
> robust and super-scalable. I'm about to move to the next level
> by
> adding the missing features, e.g. tournaments and a Flash
> client.
>
> I appreciate everything that the Erlang/OTP is doing but I
> thought I
> would vent a few of my recent frustrations with Erlang. I'm in
> a good
> mood after spending a day with OCaml and I have calmed down.
> Still,
> prepare yourself for a long rant ahead!
>
> My development workstation is a Mac Pro 2x2.8Ghz Quad Xeon,
> 12Gb of
> memory, one 250Gb and two more drives 500Gb each, all 7200RPM
> SATA. I
> use R12B3, SMP and kernel poll, i.e.
>
> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8]
> [async-
> threads:0] [kernel-poll:true]
>
> My overwhelming frustration is the opacity of a running Erlang
> system.
> There are no decent tools for peering inside. No usable ones
> whatsoever!
>
> With any other language you can profile, make changes,
> evaluate
> performance and make a judgement but not with Erlang.
>
> I first wrote OpenPoker using OTP everywhere. My players,
> games, pots,
> limits, hands, decks, etc. were all gen_server processes. I
> used
> Mnesia transactions everywhere and I used them often.
>
> Then I discovered that I cannot scale past 2-3k concurrent
> players
> under heavy use.
>
> I have a test harness that launches bots which connect to the
> server
> and play by the script. The bots don't wait before replying to
> bet
> requests and so launching a few thousand bots heavily loads
> the server.
>
> I don't want just a few thousand concurrent bots, though! I
> want at
> least 10k on a single VM and hundreds of thousands on a
> cluster, so I
> set to optimize my poker server.
>
> The Erlang Efficiency Guide recommends fprof as the tool. I
> ran fprof
> on my test harness and discovered that the result set cannot
> be
> processed in my 12Gb of memory. I made this discovery after
> leaving
> fprof running for a couple of days and realized this because
> the fprof
> data files were approaching 100Gb and my machine became
> unusable due
> to heavy swapping.
>
> fprof usets ets tables to analyze the trace results and ets
> tables
> must fit in memory.
>
> I shortened my test run and was able to see the output of the
> fprof
> trace analysis. To say that it's dense would be an
> understatement! I
> realize that dumping out tuples is easy but aren't computers
> suppose
> to help us humans?
>
> The final output from fprof is still too raw for me to
> analyze.
> There's absolutely, positively, definitely no way to get a
> picture of
> a running system by reading through it. I understand that I
> can infer
> from the analysis that certain functions take a lot of time
> but what
> if there are none?
>
> The bulk of the time in my system was taken by various OTP
> functions
> and processes, Mnesia and unknown functions. All I could infer
> from it
> is that perhaps I have too many processes.
>
> Another thing that I inferred is that the normal method of
> writing
> gen_server code doesn't work for profiling.
>
> I had to rewrite the gen_server clauses to immediately
> dispatch to
> functions, e.g.
>
> handle_cast('LOGOUT', Data) ->
> handle_cast_logout(Data);
>
> handle_cast('DISCONNECT', Data) ->
> handle_cast_disconnect(Data);
>
> otherwise all the clauses of a gen_server are squashed
> together,
> regardless of the message pattern. I don't know if there's a
> better
> way to tackle this.
>
> Next, I rewrote most of my gen_servers as data structures,
> e.g. pot,
> limit, deck, etc. A deck of cards can take a message to draw a
> card
> but the message can just as well be a function call. The deck
> structure will need to be modified regardless and the tuple
> will be
> duplicated anyway. There didn't seem to be any advantage in
> using a
> process here, much less a gen_server.
>
> Next I carefully went trough my Mnesia schema and split some
> tables
> into smaller tables. I made sure that only the absolutely
> necessary
> tables were disk-based. I wish I could run without updating
> Mnesia
> tables during a game but this is impossible since player
> balances and
> status need to be updated when players join or leave a game,
> as well
> as when a game finishes.
>
> All my hard work paid off and I was able to get close to 10K
> players,
> with kernel poll enabled, of course. Then I ran out of ETS
> tables.
>
> I don't create ETS tables on the fly but, apparently, Mnesia
> does. For
> every transaction!!!
>
> This prompted me to go through the server again and use
> dirty_read,
> dirty_write wherever possible. I also placed balanced in two
> separate
> "counter" tables, integers to be divided by 10000 to get 4
> decimal
> points of precision. This is so that I could use
> dirty_update_counter
> instead of a regular read, bump, write pattern.
>
> My frustration kept increasing but I gained more concurrent
> players. I
> can now safely run up to 8K bots before timeouts start to
> appear.
>
> These are gen_server call timeouts when requests for game
> information
> take longer than the default 5 seconds. I have an average of 5
> players
> per game so this is not because a large number of processes
> are trying
> to access the game.
>
> I suppose this is a reflection of the load on the system,
> although CPU
> usage never goes past 300% which tells me that no more than 3
> cores
> are used by Erlang.
>
> The straw that broke my back was when stopping a bot's
> matching player
> gen_server by returning {stop, ... } started causing my
> observer
> process to receive tcp_close and exit. I could repeat this
> like
> clockwork. Only spawning a separate process to send player a
> stop
> message would fix this.
>
> Then I changed the way I represent cards started seeing this
> behavior
> again, in just one of my tests. What do cards have to do with
> tcp_close? I don't know and dbg tracer is my best friend! What
> I know
> is what git tells me and git says cards were the only
> difference.
>
> Anyway, I don't think I have fully recovered yet. I may need a
> weekend
> just to regain my sanity. I will try to spread the load among
> several
> VMs but my hunch is that my distributed 100k players target is
> far far
> away. I'll may have to keep flying blind, with only traces and
> printouts to my rescue.
>
> Thanks for listening, Joel
>
> --
> wagerlabs.com
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
>
> --
> John S. Wolter President
> Wolter Works
> Mailto:johnswolter@REDACTED
> Desk 1-734-665-1263
> Cell: 1-734-904-8433
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
More information about the erlang-questions
mailing list