11 Profiling

11.1 Do not guess about performance - profile

Even experienced software developers often guess wrong about where the performance bottlenecks are in their programs.

Therefore, profile your program to see where the performance bottlenecks are and concentrate on optimizing them.

Erlang/OTP contains several tools to help finding bottlenecks.

fprof and eprof provide the most detailed information about where the time is spent, but they significantly slow downs the programs they profile.

If the program is too big to be profiled by fprof or eprof, cover and cprof could be used to locate parts of the code that should be more thoroughly profiled using fprof or eprof.

cover provides execution counts per line per process, with less overhead than fprof/eprof. Execution counts can with some caution be used to locate potential performance bottlenecks. The most lightweight tool is cprof, but it only provides execution counts on a function basis (for all processes, not per process).

11.2 Big systems

If you have a big system it might be interesting to run profiling on a simulated and limited scenario to start with. But bottlenecks have a tendency to only appear or cause problems when there are many things going on at the same time, and when there are many nodes involved. Therefore it is desirable to also run profiling in a system test plant on a real target system.

When your system is big you do not want to run the profiling tools on the whole system. You want to concentrate on processes and modules that you know are central and stand for a big part of the execution.

11.3 What to look for

When analyzing the result file from the profiling activity you should look for functions that are called many times and have a long "own" execution time (time excluded calls to other functions). Functions that just are called very many times can also be interesting, as even small things can add up to quite a bit if they are repeated often. Then you need to ask yourself what can I do to reduce this time. Appropriate types of questions to ask yourself are:

Can I reduce the number of times the function is called?
Are there tests that can be run less often if I change the order of tests?
Are there redundant tests that can be removed?
Is there some expression calculated giving the same result each time?
Is there other ways of doing this that are equivalent and more efficient?
Can I use another internal data representation to make things more efficient?

These questions are not always trivial to answer. You might need to do some benchmarks to back up your theory, to avoid making things slower if your theory is wrong. See benchmarking.

11.4 Tools

fprof

fprof measures the execution time for each function, both own time i.e how much time a function has used for its own execution, and accumulated time i.e. including called functions. The values are displayed per process. You also get to know how many times each function has been called. fprof is based on trace to file in order to minimize runtime performance impact. Using fprof is just a matter of calling a few library functions, see fprof manual page under the application tools.

fprof was introduced in version R8 of Erlang/OTP. Its predecessor eprof that is based on the Erlang trace BIFs, is still available, see eprof manual page under the application tools. Eprof shows how much time has been used by each process, and in which function calls this time has been spent. Time is shown as percentage of total time, not as absolute time.

cover

cover's primary use is coverage analysis to verify test cases, making sure all relevant code is covered. cover counts how many times each executable line of code is executed when a program is run. This is done on a per module basis. Of course this information can be used to determine what code is run very frequently and could therefore be subject for optimization. Using cover is just a matter of calling a few library functions, see cover manual page under the application tools.

cprof

cprof is something in between fprof and cover regarding features. It counts how many times each function is called when the program is run, on a per module basis. cprof has a low performance degradation (versus fprof and eprof) and does not need to recompile any modules to profile (versus cover).

Tool summarization

Tool	Results	Size of result	Effects on program execution time	Records number of calls	Records Execution time	Records called by	Records garbage collection
fprof	per process to screen/file	large	significant slowdown	yes	total and own	yes	yes
eprof	per process/function to screen/file	medium	significant slowdown	yes	only total	no	no
cover	per module to screen/file	small	moderate slowdown	yes, per line	no	no	no
cprof	per module to caller	small	small slowdown	yes	no	no	no

Table 11.1:

11.5 Benchmarking

The main purpose of benchmarking is to find out which implementation of a given algorithm or function is the fastest. Benchmarking is far from an exact science. Today's operating systems generally run background tasks that are difficult to turn off. Caches and multiple CPU cores doesn't make it any easier. It would be best to run Unix-computers in single-user mode when benchmarking, but that is inconvenient to say the least for casual testing.

Benchmarks can measure wall-clock time or CPU time.

timer:tc/3 measures wall-clock time. The advantage with wall-clock time is that I/O, swapping, and other activities in the operating-system kernel are included in the measurements. The disadvantage is that the the measurements will vary wildly. Usually it is best to run the benchmark several times and note the shortest time - that time should be the minimum time that is possible to achieve under the best of circumstances.

statistics/1 with the argument runtime measures CPU time spent in the Erlang virtual machine. The advantage is that the results are more consistent from run to run. The disadvantage is that the time spent in the operating system kernel (such as swapping and I/O) are not included. Therefore, measuring CPU time is misleading if any I/O (file or sockets) are involved.

It is probably a good idea to do both wall-clock measurements and CPU time measurements.

Some additional advice:

The granularity of both types measurement could be quite high so you should make sure that each individual measurement lasts for at least several seconds.
To make the test fair, each new test run should run in its own, newly created Erlang process. Otherwise, if all tests runs in the same process, the later tests would start out with larger heap sizes and therefore probably does less garbage collections. You could also consider restarting the Erlang emulator between each test.
Do not assume that the fastest implementation of a given algorithm on computer architecture X also is the fast on computer architecture Y.