[erlang-questions] ETS and CPU

Hynek Vychodil <>
Wed Mar 16 22:22:09 CET 2016


It's hard to tell because we still don't know much about your use case.
Where the data in map comes from? How often do they change?  How big
portion of keys change? How big portion of keys is read in each process?

For example, if data in map change less than like once per quarter of an
hour and is read heavily I would consider compile module with literal
constant. If the data in map changes often but a small portion of keys and
also a small portion of keys are read I would definitely store data in ets
one key per record which is what Jesper Louis Andersen suggested and so on.
It depends heavily on your exact use case.

One thing we can tell you for sure, each time you call ets:lookup/2 the
whole record from ets is copied in memory. Each time you send a message,
the whole message is copied in memory. Each time you spawn process the
whole initial data is copied in memory.

I made some more measurements. When I use 2 schedulers, sending messages or
using spawn are 96% slower than ets:lookup/2. When I use 4 chedulers the
difference is only 70% so there is a possibility that wich much more
schedulers it could be even faster. I don't have enough data. But it is
just out of my curiosity because for your real use case there could be a
way better solution.

On Wed, Mar 16, 2016 at 9:44 PM, Alex Howle <> wrote:

> Thank you very much for taking the time to experiment with this!
>
> Does that mean that splitting the 1MB map into smaller maps would somehow
> be better, do you think? All parts of the map are required for the
> processing to be successful.
>
> On Wed, Mar 16, 2016 at 5:58 PM, Hynek Vychodil <>
> wrote:
>
>> I have tried parallel version of msg and arg
>>
>> msg_p(N, Msg) ->
>>     do_p(fun msg/2, N, Msg).
>>
>> arg_p(N, Msg) ->
>>     do_p(fun arg/2, N, Msg).
>>
>> do_p(F, N, Msg) ->
>>     Schedulers = erlang:system_info(schedulers),
>>     Parent = self(),
>>     N2 = N div Schedulers,
>>     Pids = [spawn_link(fun() -> F(N2, Msg), Parent ! {ok, self()} end)
>>             || _ <- lists:seq(1, Schedulers) ],
>>     [ receive {ok, Pid} -> ok end || Pid <- Pids].
>>
>> and it performs better but still worse than ets but I don't know how it
>> would behave on HW with 40 CPUs/schedulers
>>
>> [[{ets_h,787.688},
>>   {ets,2215.42},
>>   {msg_p,2525.365},
>>   {msg,4964.156},
>>   {arg_p,2780.5},
>>   {arg,4248.214}],
>>  [{ets_h,901.369},
>>   {ets,2343.145},
>>   {msg_p,2368.203},
>>   {msg,5062.984},
>>   {arg_p,2073.172},
>>   {arg,4260.998}],
>>  [{ets_h,906.705},
>>   {ets,2423.889},
>>   {msg_p,3135.662},
>>   {msg,5069.39},
>>   {arg_p,2186.49},
>>   {arg,4268.753}]]
>>
>> Setting initial heap size in msg helps little bit
>>
>> msg(N, Msg) ->
>>     Size = 2*erts_debug:flat_size(Msg),
>>     Pids = [ spawn_opt(fun loop/0, [link, {min_heap_size,Size}]) || _ <-
>> lists:seq(1, N) ],
>>     [ Pid ! {msg, self(), Msg} || Pid <- Pids],
>>     [ receive {ok, Pid} -> ok end || Pid <- Pids ].
>>
>> [[{ets_h,823.901},
>>   {ets,2200.168},
>>   {msg_p,1974.292},
>>   {msg,4678.855},
>>   {arg_p,2082.779},
>>   {arg,4666.294}],
>>  [{ets_h,906.677},
>>   {ets,2033.719},
>>   {msg_p,2092.892},
>>   {msg,4665.692},
>>   {arg_p,2005.953},
>>   {arg,4707.86}],
>>  [{ets_h,902.813},
>>   {ets,2290.883},
>>   {msg_p,2041.713},
>>   {msg,4655.373},
>>   {arg_p,2011.422},
>>   {arg,4659.18}]]
>>
>> So I think sending message could be reasonably faster than ets version on
>> HW with 40 CPUs. Anyway storing or sending map this big doesn't seem good
>> design.
>>
>>
>> On Wed, Mar 16, 2016 at 6:33 PM, Hynek Vychodil <
>> > wrote:
>>
>>> I was curious enough to try it:
>>>
>>> -module(ets_vs_msg).
>>>
>>> -export([start/1]).
>>>
>>> -export([ets/2, ets_h/2, msg/2, arg/2]).
>>>
>>> -define(Tab, ?MODULE).
>>>
>>> -define(MapSize, 100000). %% 100000 is 2.87 MB
>>>
>>> start(N) ->
>>>     Map = gen_map(),
>>>     ets_init(Map),
>>>     [[{X, element(1, timer:tc(fun ?MODULE:X/2, [N, Map]))/N}
>>>       || X <- [ets_h, ets, msg, arg]]
>>>      || _ <- lists:seq(1, 3)].
>>>
>>> gen_map() ->
>>>     gen_map(?MapSize).
>>>
>>> gen_map(N) ->
>>>     maps:from_list([{X, []} || X <- lists:seq(1, N)]).
>>>
>>> ets_init(Map) ->
>>>     (catch ets:new(?Tab, [named_table])),
>>>     ets:insert(?Tab, {foo, Map}).
>>>
>>> ets(N, _Msg) ->
>>>     Pids = [ spawn_link(fun loop/0) || _ <- lists:seq(1, N) ],
>>>     [ Pid ! {ets, self()} || Pid <- Pids],
>>>     [ receive {ok, Pid} -> ok end || Pid <- Pids ].
>>>
>>> ets_h(N, Msg) ->
>>>     Size = 2*erts_debug:flat_size(Msg),
>>>     Pids = [ spawn_opt(fun loop/0, [link, {min_heap_size,Size}]) || _ <-
>>> lists:seq(1, N) ],
>>>     [ Pid ! {ets, self()} || Pid <- Pids],
>>>     [ receive {ok, Pid} -> ok end || Pid <- Pids ].
>>>
>>> msg(N, Msg) ->
>>>     Pids = [ spawn_link(fun loop/0) || _ <- lists:seq(1, N) ],
>>>     [ Pid ! {msg, self(), Msg} || Pid <- Pids],
>>>     [ receive {ok, Pid} -> ok end || Pid <- Pids ].
>>>
>>> arg(N, Msg) ->
>>>     Pids = [ spawn_link(fun() -> init(Msg) end) || _ <- lists:seq(1, N)
>>> ],
>>>     [ Pid ! {do, self()} || Pid <- Pids],
>>>     [ receive {ok, Pid} -> ok end || Pid <- Pids ].
>>>
>>> init(_) ->
>>>     loop().
>>>
>>> loop() ->
>>>     receive
>>>         {ets, From} ->
>>>             ets:lookup(?Tab, foo),
>>>             From;
>>>         {msg, From, _Msg} ->
>>>             From;
>>>         {do, From} ->
>>>             From
>>>     end ! {ok, self()}.
>>>
>>> Reading from ets with prepared heap is clear winner:
>>>
>>> 40> ets_vs_msg:start(1000).
>>> [[{ets_h,805.83},{ets,2383.31},{msg,4492.15},{arg,3957.693}],
>>>  [{ets_h,918.221},
>>>   {ets,2379.459},
>>>   {msg,4651.258},
>>>   {arg,4028.799}],
>>>  [{ets_h,927.538},
>>>   {ets,2370.421},
>>>   {msg,4519.885},
>>>   {arg,4057.264}]]
>>>
>>> But there is a catch. If I look to CPU utilisation, only ets_h and ets
>>> uses all cores/schedulers (i7 with 4 HT in my case) which indicate that
>>> both msg and arg version copy the map from the single process. In my case
>>> sending a message from more processes would lead to max 4x speed up for msg
>>> and arg version.
>>>
>>> On Wed, Mar 16, 2016 at 5:20 PM, Sverker Eriksson <
>>> > wrote:
>>>
>>>> Well, I would expect copy_shallow (from ETS) to be less CPU intensive
>>>> than copy_struct (from process).
>>>>
>>>> However, as indicated by others, ets:lookup on such a big map will
>>>> probably
>>>> trigger a garbage collection on the process, which will lead to
>>>> yet another copy of the big map.
>>>>
>>>> The spawn(fun() -> do_something(BigMap) end) on the other hand will
>>>> allocate a big enough heap for the process form the start and only do
>>>> one copy of the big map.
>>>>
>>>> /Sverker, Erlang/OTP
>>>>
>>>>
>>>>
>>>> On 03/16/2016 10:43 AM, Alex Howle wrote:
>>>>
>>>> Assuming that when you say "win" you mean that ets:lookup should be
>>>> more efficient (and less CPU intensive) then I'm seeing the opposite.
>>>> On 15 Mar 2016 11:32, "Sverker Eriksson" <>
>>>> wrote:
>>>>
>>>>> Each successful ets:lookup call is a copy operation of the entire term
>>>>> from ETS to the process heap.
>>>>>
>>>>> If you are comparing ets:lookup of big map
>>>>> to sending big map in message then I would expect
>>>>> ets:lookup to win, as copy_shallow (used by ets:lookup)
>>>>> is optimized to be faster than copy_struct (used by send).
>>>>>
>>>>>
>>>>> /Sverker, Erlang/OTP
>>>>>
>>>>>
>>>>> On 03/15/2016 09:52 AM, Alex Howle wrote:
>>>>>
>>>>> I've been experiencing an issue and was wondering if anyone else has
>>>>> any experience in this area. I've stripped back the problem to its bare
>>>>> bones for the purposes of this mail.
>>>>>
>>>>>
>>>>>
>>>>> I have an Erlang 18.1 application that uses ETS to store an Erlang map
>>>>> structure. Using erts_debug:flat_size/1 I can approximate the map's size to
>>>>> be 1MB. Upon the necessary activity trigger the application spawns about 25
>>>>> short-lived processes to perform the main work of the application. This
>>>>> activity trigger is fired roughly 9 times a second under normal operating
>>>>> conditions. Each of these 25 processes performs 1 x ets:lookup/2 calls to
>>>>> read from the map.
>>>>>
>>>>>
>>>>>
>>>>> What I've found is that the above implementation has a CPU profile
>>>>> that is quite "expensive" - each of the CPU cores (40 total comprised of 2
>>>>> Processors with 10 hyperthreaded cores) frequently runs at 100%. The
>>>>> machine in question also has 32GB RAM of which about 9GB is used at peak.
>>>>> There is no swap usage whatsoever. Examination shows that copy_shallow is
>>>>> performing the most work.
>>>>>
>>>>>
>>>>>
>>>>> After changing the implementation so that the 25 spawned processes no
>>>>> longer read from the ETS table to retrieve the map structure and, instead
>>>>> the map is passed to the processes on spawn, the CPU usage on the server is
>>>>> considerably lower.
>>>>>
>>>>>
>>>>>
>>>>> Can anyone offer advice as to why I'm seeing the differing CPU
>>>>> profiles?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing ://erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> 
>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160316/c3c84515/attachment.html>


More information about the erlang-questions mailing list