[erlang-questions] My frustration with Erlang

Mon Sep 15 08:17:20 CEST 2008

Greetings,

Sometimes it is possible to ask for new cards.

bengt

On Fri, 2008-09-12 at 14:11 -0400, john s wolter wrote:
> Joel,
> 
> 
> Like any technical project you are dealt a hand and you have to play
> it.
> 
> On Fri, Sep 12, 2008 at 9:52 AM, Joel Reymont <joelr1@REDACTED>
> wrote:
>         I sell a poker server written in Erlang. It's supposed to be
>         super-
>         robust and super-scalable. I'm about to move to the next level
>         by
>         adding the missing features, e.g. tournaments and a Flash
>         client.
>         
>         I appreciate everything that the Erlang/OTP is doing but I
>         thought I
>         would vent a few of my recent frustrations with Erlang. I'm in
>         a good
>         mood after spending a day with OCaml and I have calmed down.
>         Still,
>         prepare yourself for a long rant ahead!
>         
>         My development workstation is a Mac Pro 2x2.8Ghz Quad Xeon,
>         12Gb of
>         memory, one 250Gb and two more drives 500Gb each, all 7200RPM
>         SATA. I
>         use R12B3, SMP and kernel poll, i.e.
>         
>         Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8]
>         [async-
>         threads:0] [kernel-poll:true]
>         
>         My overwhelming frustration is the opacity of a running Erlang
>         system.
>         There are no decent tools for peering inside. No usable ones
>         whatsoever!
>         
>         With any other language you can profile, make changes,
>         evaluate
>         performance and make a judgement but not with Erlang.
>         
>         I first wrote OpenPoker using OTP everywhere. My players,
>         games, pots,
>         limits, hands, decks, etc. were all gen_server processes. I
>         used
>         Mnesia transactions everywhere and I used them often.
>         
>         Then I discovered that I cannot scale past 2-3k concurrent
>         players
>         under heavy use.
>         
>         I have a test harness that launches bots which connect to the
>         server
>         and play by the script. The bots don't wait before replying to
>         bet
>         requests and so launching a few thousand bots heavily loads
>         the server.
>         
>         I don't want just a few thousand concurrent bots, though! I
>         want at
>         least 10k on a single VM and hundreds of thousands on a
>         cluster, so I
>         set to optimize my poker server.
>         
>         The Erlang Efficiency Guide recommends fprof as the tool. I
>         ran fprof
>         on my test harness and discovered that the result set cannot
>         be
>         processed in my 12Gb of memory. I made this discovery after
>         leaving
>         fprof running for a couple of days and realized this because
>         the fprof
>         data files were approaching 100Gb and my machine became
>         unusable due
>         to heavy swapping.
>         
>         fprof usets ets tables to analyze the trace results and ets
>         tables
>         must fit in memory.
>         
>         I shortened my test run and was able to see the output of the
>         fprof
>         trace analysis. To say that it's dense would be an
>         understatement! I
>         realize that dumping out tuples is easy but aren't computers
>         suppose
>         to help us humans?
>         
>         The final output from fprof is still too raw for me to
>         analyze.
>         There's absolutely, positively, definitely no way to get a
>         picture of
>         a running system by reading through it. I understand that I
>         can infer
>         from the analysis that certain functions take a lot of time
>         but what
>         if there are none?
>         
>         The bulk of the time in my system was taken by various OTP
>         functions
>         and processes, Mnesia and unknown functions. All I could infer
>         from it
>         is that perhaps I have too many processes.
>         
>         Another thing that I inferred is that the normal method of
>         writing
>         gen_server code doesn't work for profiling.
>         
>         I had to rewrite the gen_server clauses to immediately
>         dispatch to
>         functions, e.g.
>         
>         handle_cast('LOGOUT', Data) ->
>             handle_cast_logout(Data);
>         
>         handle_cast('DISCONNECT', Data) ->
>             handle_cast_disconnect(Data);
>         
>         otherwise all the clauses of a gen_server are squashed
>         together,
>         regardless of the message pattern. I don't know if there's a
>         better
>         way to tackle this.
>         
>         Next, I rewrote most of my gen_servers as data structures,
>         e.g. pot,
>         limit, deck, etc. A deck of cards can take a message to draw a
>         card
>         but the message can just as well be a function call. The deck
>         structure will need to be modified regardless and the tuple
>         will be
>         duplicated anyway. There didn't seem to be any advantage in
>         using a
>         process here, much less a gen_server.
>         
>         Next I carefully went trough my Mnesia schema and split some
>         tables
>         into smaller tables. I made sure that only the absolutely
>         necessary
>         tables were disk-based. I wish I could run without updating
>         Mnesia
>         tables during a game but this is impossible since player
>         balances and
>         status need to be updated when players join or leave a game,
>         as well
>         as when a game finishes.
>         
>         All my hard work paid off and I was able to get close to 10K
>         players,
>         with kernel poll enabled, of course. Then I ran out of ETS
>         tables.
>         
>         I don't create ETS tables on the fly but, apparently, Mnesia
>         does. For
>         every transaction!!!
>         
>         This prompted me to go through the server again and use
>         dirty_read,
>         dirty_write wherever possible. I also placed balanced in two
>         separate
>         "counter" tables, integers to be divided by 10000 to get 4
>         decimal
>         points of precision. This is so that I could use
>         dirty_update_counter
>         instead of a regular read, bump, write pattern.
>         
>         My frustration kept increasing but I gained more concurrent
>         players. I
>         can now safely run up to 8K bots before timeouts start to
>         appear.
>         
>         These are gen_server call timeouts when requests for game
>         information
>         take longer than the default 5 seconds. I have an average of 5
>         players
>         per game so this is not because a large number of processes
>         are trying
>         to access the game.
>         
>         I suppose this is a reflection of the load on the system,
>         although CPU
>         usage never goes past 300% which tells me that no more than 3
>         cores
>         are used by Erlang.
>         
>         The straw that broke my back was when stopping a bot's
>         matching player
>         gen_server by returning {stop, ... } started causing my
>         observer
>         process to receive tcp_close and exit. I could repeat this
>         like
>         clockwork. Only spawning a separate process to send player a
>         stop
>         message would fix this.
>         
>         Then I changed the way I represent cards started seeing this
>         behavior
>         again, in just one of my tests. What do cards have to do with
>         tcp_close? I don't know and dbg tracer is my best friend! What
>         I know
>         is what git tells me and git says cards were the only
>         difference.
>         
>         Anyway, I don't think I have fully recovered yet. I may need a
>         weekend
>         just to regain my sanity. I will try to spread the load among
>         several
>         VMs but my hunch is that my distributed 100k players target is
>         far far
>         away. I'll may have to keep flying blind, with only traces and
>         printouts to my rescue.
>         
>                Thanks for listening, Joel
>         
>         --
>         wagerlabs.com
>         
>         _______________________________________________
>         erlang-questions mailing list
>         erlang-questions@REDACTED
>         http://www.erlang.org/mailman/listinfo/erlang-questions
> 
> 
> 
> -- 
> John S. Wolter President
> Wolter Works
> Mailto:johnswolter@REDACTED
> Desk 1-734-665-1263
> Cell: 1-734-904-8433
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions