[erlang-questions] Wanted additions to the maps module?

Tue May 10 15:39:57 CEST 2016

Thanks for the comments Björn-Egil. I did consider the concerns that you 
mentioned, but I wanted to take a pragmatic approach. I was more 
interested in practical implications of using a particular data 
structure rather than its idealistic capabilities:

1. All test data used in all tests is generated only _once_. This is to 
eliminate variations where specific structures work better with specific 
inputs.

2. Tests are run in turns the specified amount of times and the output 
is not averaged. This is to verify the reproducibility of the test for 
the same input data.

3. Key and Value are both small integers. This is to minimize the 
copying in ETS and to expose the overhead of calculating a hash in 
hash-based data structures.

4. The data structure is pre-populated with some data before starting to 
measure the time. This is to simulate longer-running processes that most 
of the time aren't empty.

5. Readings of existing and non-existing keys is interleaved with 
storing new keys. This is to reflect practical usage in typical 
applications.

6. Each test is run in isolation, in a new process. This is to measure 
the consumed memory most accurately - including all generated and not 
collected garbage.

> Performance wise Maps usually beats ETS. Apples and Oranges however. 
> ETS has other features so it's not a fair comparison.

Yeah, that's what my test concluded. Initially for small key/values maps 
are slightly faster. The difference would only grow in favour of maps 
for more complex payload because of the copying between ETS and the 
process. But maps consume much more memory than ETS. I believe the 
memory released when reallocating structures in ETS doesn't need to be 
garbage collected since it's implemented in C outside of any process memory?

>>     https://gist.github.com/amiramix/d43c9a73a6fe6d651d7f
>>
>>     Maps are quite performant but process dictionary is still quicker
>>     and maps are the worst when it comes to consumed memory, taking
>>     twice as much as dict or process dictionary and over 5 times as
>>     much memory as ets.
>>
>> Well, you are comparing apples and oranges. Process dictionary and 
>> ETS are something completely different from gb_trees, dict, maps or 
>> orddict.

What's the point of comparing gb_tress to gb_trees or maps to maps?

>>
>> I think you should look over how you measure memory. It seems very 
>> arbitrary. What are you measuring? The size of the process after it 
>> all? The gc can shrink or grow the heap during execution, trying to 
>> find a suitable size. Also look at the number of gc:s during the 
>> execution, this will indicate how garbage you are generating. 
>> Normally benchmarks ends with an explicit garbage_collect before 
>> looking at the size. This gives a more fair indication of memory 
>> consumption.
>> It all depends on what you want to measure though, so look it over.
>>
>> Another approach would be to use erts_debug:flat_size/1. With the 
>> functional data structures in this test you'll get:
>>
>> - dict: 67568
>> - gb_trees: 78668
>> - map_match: 58543
>> - list: 78665
>> - orddict: 78665
>>
>>

I am not measuring the "size of the process after it all". I am 
measuring how much the size of an Erlang process grows after storing and 
reading 30000 key/value pairs. This is to give me a better understanding 
of how much memory I would potentially need to store (for example) a 
billion of key-value data elements in one million of Erlang processes 
(one thousand of such elements in each Erlang process).

Measuring the flat size maybe useful for developers implementing the 
data structure. But does it tell me how much resources (CPU & memory) 
the VM can consume, and thus how much hardware I would need to buy to 
run such VM reliably?

I appreciate your comments so please let me know if there is anything 
you would like to add.

Coming back to the iterator question, some data structures may be better 
suitable for implementing such iterator than others. If for now the LUA 
table could be implemented either in maps or dict, then maybe if say, 
process dictionary supported iterators efficiently, they could well be 
implemented using PD instead?

/Grzegorz

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160510/fad57d26/attachment.htm>