[erlang-questions] Story of one performance bug (active, once vs active, N)

Fri May 13 10:05:58 CEST 2016

Hi.  Will try to share our experience with hunting one strange performance
bug.

One our (Flussonic) customer asked us to help because his web UI panel
started working very slowly after update.  Short look in network inspector
of Chrome showed that blocker was a HTTP API call to method that parses
config.

Our config parser (generated by sub-optimal neotoma grammar) isn't blazing
fast, but small config was parsed during 3 minutes (!)   3 minutes of
running erlang with some ets calls and lot of code.  Usually such amount of
text is parsed during 1 second.

We've logged on server and in console have seen that such API call makes
something with erlang and during several minutes there are 400-700 active
processes:

> erlang:statistics(total_active_tasks).
740

When such schedule storm comes  CPU used by flussonic is lowered from 1000%
to 400%

I will tell in other post how we have succesfully used lcnt, but here is a
story about pure in-erlang performance degradation.

After several hours of bisecting flussonic versions and minor commits we
have found commit that broke everything.

Flussonic on that server was working with moderate amount of cameras (350
IP cameras via protocol RTSP) and 500 mbit/s of input traffic.  This data
flow is recorded on disks and restreamed to several clients (maybe 50
mbit/s of output).

Some versions ago there was code that activated sockets with  {active, N}
 but this code was removed (don't remember why have I done it, but git
blame ruthlessly tells that I did it) and replaced by {active, once}

We have immediately replaced this code by another tricky version:

handle_info({tcp, _Socket, Bin}, #rtsp{} = State) ->
   State1 = handle_data(Bin, State),
   erlang:send_after(100, self(), activate_tcp),
   {noreply, State1};

handle_info(activate_tcp, #rtsp{socket = Socket} = State) ->
   inet:setopts(Socket, [{active,once}]),
   {noreply, State}

This simple lines of code immediately made everything works ok:  ~900% of
CPU and immediate response to API.

Here is our interpretation of this situation.

When I changed activation to once, rtsp socket was sending one message per
1460 bytes (just as IP camera sends). The trick here is that all IP cameras
are sending into TCP socket data in the same was as they send in UDP
socket. All video frames are fragmented into pieces so that they can fit
into MTU.

So when IP camera makes   write(fd, data, len),  len is always not more
than 1460. IP camera tries to send not large bursts, but rather regular
packet flow.

So, flussonic started receiving 500 mbit / 1460 bytes = 40 K of messages
per second.
About 120 messages per second per each process on 24 core Xeon server.

After adding this simple trick with later activation we give gen_tcp socket
some time to fill buffer and average data size raised from 1460 to 50000
bytes.  So amount of messages reduced to 2000 per second.

Perhaps when I changed to active,once  we've got situation that each rtsp
handling process was rescheduled immediately after calling
 inet:setopts(Socket, [{active,once}]) because data has already arrived
during handling previous packet.  This rescheduling storm blocks time for
all other processes. When we increased average size of packet situation
became better because each rtsp handling process started working more time
in actual reductions and less amount of time in scheduling.

So looks that batch handling of data is cheaper than often scheduling.

What is unclear for me:  why was  active,N  better than active,once?   I
haven't checked but I think that still average size of tcp message and
amount of messages was the same as in  active,once case.

Next post will be about locks and ets tables. It is really interesting!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160513/0c3d7d92/attachment.htm>