[erlang-questions] My frustration with Erlang
Joel Reymont
joelr1@REDACTED
Fri Sep 12 15:52:07 CEST 2008
I sell a poker server written in Erlang. It's supposed to be super-
robust and super-scalable. I'm about to move to the next level by
adding the missing features, e.g. tournaments and a Flash client.
I appreciate everything that the Erlang/OTP is doing but I thought I
would vent a few of my recent frustrations with Erlang. I'm in a good
mood after spending a day with OCaml and I have calmed down. Still,
prepare yourself for a long rant ahead!
My development workstation is a Mac Pro 2x2.8Ghz Quad Xeon, 12Gb of
memory, one 250Gb and two more drives 500Gb each, all 7200RPM SATA. I
use R12B3, SMP and kernel poll, i.e.
Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-
threads:0] [kernel-poll:true]
My overwhelming frustration is the opacity of a running Erlang system.
There are no decent tools for peering inside. No usable ones whatsoever!
With any other language you can profile, make changes, evaluate
performance and make a judgement but not with Erlang.
I first wrote OpenPoker using OTP everywhere. My players, games, pots,
limits, hands, decks, etc. were all gen_server processes. I used
Mnesia transactions everywhere and I used them often.
Then I discovered that I cannot scale past 2-3k concurrent players
under heavy use.
I have a test harness that launches bots which connect to the server
and play by the script. The bots don't wait before replying to bet
requests and so launching a few thousand bots heavily loads the server.
I don't want just a few thousand concurrent bots, though! I want at
least 10k on a single VM and hundreds of thousands on a cluster, so I
set to optimize my poker server.
The Erlang Efficiency Guide recommends fprof as the tool. I ran fprof
on my test harness and discovered that the result set cannot be
processed in my 12Gb of memory. I made this discovery after leaving
fprof running for a couple of days and realized this because the fprof
data files were approaching 100Gb and my machine became unusable due
to heavy swapping.
fprof usets ets tables to analyze the trace results and ets tables
must fit in memory.
I shortened my test run and was able to see the output of the fprof
trace analysis. To say that it's dense would be an understatement! I
realize that dumping out tuples is easy but aren't computers suppose
to help us humans?
The final output from fprof is still too raw for me to analyze.
There's absolutely, positively, definitely no way to get a picture of
a running system by reading through it. I understand that I can infer
from the analysis that certain functions take a lot of time but what
if there are none?
The bulk of the time in my system was taken by various OTP functions
and processes, Mnesia and unknown functions. All I could infer from it
is that perhaps I have too many processes.
Another thing that I inferred is that the normal method of writing
gen_server code doesn't work for profiling.
I had to rewrite the gen_server clauses to immediately dispatch to
functions, e.g.
handle_cast('LOGOUT', Data) ->
handle_cast_logout(Data);
handle_cast('DISCONNECT', Data) ->
handle_cast_disconnect(Data);
otherwise all the clauses of a gen_server are squashed together,
regardless of the message pattern. I don't know if there's a better
way to tackle this.
Next, I rewrote most of my gen_servers as data structures, e.g. pot,
limit, deck, etc. A deck of cards can take a message to draw a card
but the message can just as well be a function call. The deck
structure will need to be modified regardless and the tuple will be
duplicated anyway. There didn't seem to be any advantage in using a
process here, much less a gen_server.
Next I carefully went trough my Mnesia schema and split some tables
into smaller tables. I made sure that only the absolutely necessary
tables were disk-based. I wish I could run without updating Mnesia
tables during a game but this is impossible since player balances and
status need to be updated when players join or leave a game, as well
as when a game finishes.
All my hard work paid off and I was able to get close to 10K players,
with kernel poll enabled, of course. Then I ran out of ETS tables.
I don't create ETS tables on the fly but, apparently, Mnesia does. For
every transaction!!!
This prompted me to go through the server again and use dirty_read,
dirty_write wherever possible. I also placed balanced in two separate
"counter" tables, integers to be divided by 10000 to get 4 decimal
points of precision. This is so that I could use dirty_update_counter
instead of a regular read, bump, write pattern.
My frustration kept increasing but I gained more concurrent players. I
can now safely run up to 8K bots before timeouts start to appear.
These are gen_server call timeouts when requests for game information
take longer than the default 5 seconds. I have an average of 5 players
per game so this is not because a large number of processes are trying
to access the game.
I suppose this is a reflection of the load on the system, although CPU
usage never goes past 300% which tells me that no more than 3 cores
are used by Erlang.
The straw that broke my back was when stopping a bot's matching player
gen_server by returning {stop, ... } started causing my observer
process to receive tcp_close and exit. I could repeat this like
clockwork. Only spawning a separate process to send player a stop
message would fix this.
Then I changed the way I represent cards started seeing this behavior
again, in just one of my tests. What do cards have to do with
tcp_close? I don't know and dbg tracer is my best friend! What I know
is what git tells me and git says cards were the only difference.
Anyway, I don't think I have fully recovered yet. I may need a weekend
just to regain my sanity. I will try to spread the load among several
VMs but my hunch is that my distributed 100k players target is far far
away. I'll may have to keep flying blind, with only traces and
printouts to my rescue.
Thanks for listening, Joel
--
wagerlabs.com
More information about the erlang-questions
mailing list