[erlang-questions] Erlang Cost Model

Thu Sep 17 17:56:47 CEST 2015

On 09/17, Eric des Courtis wrote:
>Regardless if concurrency and parallelism plays a role *I still need to
>know if an operation in a sequential process I am working on is O(1) or
>O(n)* or am I missing something?

These values are way more worthwhile as a property of libraries than 
language, since they're far more impacted by the choice of algorithm 
taken there.

Let's look at a few modules:

1. Queue

"all operations has an amortized O(1) running time, except len/1, 
join/2, split/2, filter/2 and member/2 that have O(n). To minimize the 
size of a queue minimizing the amount of garbage built by queue 
operations, the queues do not contain explicit length information, and 
that is why len/1 is O(n)."

2. Maps

undefined in the module doc, but from the EEP 
(http://www.erlang.org/eeps/eep-0043.html):

"has at most O(log N) time complexity in insert and lookup operations, 
where N is the number of key-value associations."

3. Dict

undefined in docs. I know that it is a very flat tree (bunch of buckets) 
acting as a hash map, so O(log N)

4. Array

undefined in docs. Also a very flat tree. O(Log N), where N is the size 
of the array, and if I recall, the tree has a branching factor of 10 to 
compromise between the speed of lookups and garbage generated when 
modifying it. It's described in the source only, sadly.

5. Gb_trees

Doc compares them to AVL trees, without storage overhead and with better 
performance. O(log N) it is (those are the worst cases for AVL trees)

6. Sets

They're using the same mechanism as dicts, so O(Log N)

7. Gb_sets

Same mechanism as gb_trees, so O(log N)

8. ETS

Doc says: "These provide the ability to store very large quantities of 
data in an Erlang runtime system, and to have constant access time to 
the data. (In the case of ordered_set, see below, access time is 
proportional to the logarithm of the number of objects stored)."

9. Orddict

Doc says " An orddict is a representation of a dictionary, where a list 
of pairs is used to store the keys and values. The list is ordered after 
the keys."

lists are O(N) traversal and insertion, so let's consider this the best 
case.

10. ordsets

implemented over orddict, O(N) too.

11. sofs

despite having the most cryptic documentation, it does mention: "The 
execution time of the functions of this module is dominated by the time 
it takes to sort lists. When no sorting is needed, the execution time is 
in the worst case proportional to the sum of the sizes of the input 
arguments and the returned value. A few functions execute in constant 
time: from_external, is_empty_set, is_set, is_sofs_set, to_external, 
type."

O(N log N) for most (optimal sort time, where N = sum(Input)), with a 
list of O(1) functions

12. Digraph

Unspecified, no idea

13. lists:key* functions

using lists, so O(N), but they're fast given they're implemented in C

14. Proplists

Using lists, so O(N).

Now with these modules somewhat covered (there's a lot of subtleties in 
the APIs making some better than others depending on the access 
patterns), it's interesting to look at basic operations we may do a lot, 
for example:

1. Message receiving (non-selective)

O(1), though accessing the mailbox when it's very large can trigger GCs 
and mess up expectations a bit.

2. Message receiving (selective)

O(N) selection where N is the mailbox size. An optimization existsw for 
common access patterns where a reference (with make_ref()) is created in 
the scope of the receive and used in all clauses, which truncates the 
existing mailbox, therefore reducing N.

3. Sorting (lists:sort)

O(N log N), performing a fancypants merge sort

So yeah, for a lot of data structures, the information is already there.  
For many, it's also still missing.

Regards,
Fred.