5  How to Interpret the Erlang Crash Dumps

5 How to Interpret the Erlang Crash Dumps

This section describes the erl_crash.dump file generated upon abnormal exit of the Erlang runtime system.

Note

The Erlang crash dump had a major facelift in Erlang/OTP R9C. The information in this section is therefore not directly applicable for older dumps. However, if you use crashdump_viewer(3) on older dumps, the crash dumps are translated into a format similar to this.

The system writes the crash dump in the current directory of the emulator or in the file pointed out by the environment variable (whatever that means on the current operating system) ERL_CRASH_DUMP. For a crash dump to be written, a writable file system must be mounted.

Crash dumps are written mainly for one of two reasons: either the built-in function erlang:halt/1 is called explicitly with a string argument from running Erlang code, or the runtime system has detected an error that cannot be handled. The most usual reason that the system cannot handle the error is that the cause is external limitations, such as running out of memory. A crash dump caused by an internal error can be caused by the system reaching limits in the emulator itself (like the number of atoms in the system, or too many simultaneous ETS tables). Usually the emulator or the operating system can be reconfigured to avoid the crash, which is why interpreting the crash dump correctly is important.

On systems that support OS signals, it is also possible to stop the runtime system and generate a crash dump by sending the SIGUSR1 signal.

The Erlang crash dump is a readable text file, but it can be difficult to read. Using the Crashdump Viewer tool in the Observer application simplifies the task. This is a wx-widget-based tool for browsing Erlang crash dumps.

The first part of the crash dump shows the following:

  • The creation time for the dump
  • A slogan indicating the reason for the dump
  • The system version of the node from which the dump originates
  • The number of atoms in the atom table
  • The runtime system thread that caused the crash dump

The reason for the dump is shown in the beginning of the file as:

Slogan: <reason>

If the system is halted by the BIF erlang:halt/1, the slogan is the string parameter passed to the BIF, otherwise it is a description generated by the emulator or the (Erlang) kernel. Normally the message is enough to understand the problem, but some messages are described here. Notice that the suggested reasons for the crash are only suggestions. The exact reasons for the errors can vary depending on the local applications and the underlying operating system.

The system has run out of memory. <A> is the allocator that failed to allocate memory, <N> is the number of bytes that <A> tried to allocate, and <T> is the memory block type that the memory was needed for. The most common case is that a process stores huge amounts of data. In this case <T> is most often heap, old_heap, heap_frag, or binary. For more information on allocators, see erts_alloc(3).

Same as above except that memory was reallocated instead of allocated when the system ran out of memory.

Error in compiled code, beam file damaged, or error in the compiler.

The Kernel/STDLIB applications are damaged or the start script is damaged.

The number of file descriptors for sockets exceeds 1024 (Unix only). The limit on file descriptors in some Unix flavors can be set to over 1024, but only 1024 sockets/pipes can be used simultaneously by Erlang (because of limitations in the Unix select call). The number of open regular files is not affected by this.

Sending the SIGUSR1 signal to an Erlang machine (Unix only) forces a crash dump. This slogan reflects that the Erlang machine crash-dumped because of receiving that signal.

The kernel supervisor has detected a failure, usually that the application_controller has shut down (Who = application_controller, Why = shutdown). The application controller can have shut down for many reasons, the most usual is that the node name of the distributed Erlang node is already in use. A complete supervisor tree "crash" (that is, the top supervisors have exited) gives about the same result. This message comes from the Erlang code and not from the virtual machine itself. It is always because of some failure in an application, either within OTP or a "user-written" one. Looking at the error log for your application is probably the first step to take.

The primitive Erlang boot sequence was terminated, most probably because the boot script has errors or cannot be read. This is usually a configuration error; the system can have been started with a faulty -boot parameter or with a boot script from the wrong OTP version.

One of the kernel processes could not start. This is probably because of faulty arguments (like errors in a -config argument) or faulty configuration files. Check that all files are in their correct location and that the configuration files (if any) are not damaged. Usually messages are also written to the controlling terminal and/or the error log explaining what is wrong.

Other errors than these can occur, as the erlang:halt/1 BIF can generate any message. If the message is not generated by the BIF and does not occur in the list above, it can be because of an error in the emulator. There can however be unusual messages, not mentioned here, which are still connected to an application failure. There is much more information available, so a thorough reading of the crash dump can reveal the crash reason. The size of processes, the number of ETS tables, and the Erlang data on each process stack can be useful to find the problem.

The number of atoms in the system at the time of the crash is shown as Atoms: <number>. Some ten thousands atoms is perfectly normal, but more can indicate that the BIF erlang:list_to_atom/1 is used to generate many different atoms dynamically, which is never a good idea.

Under the tag =scheduler is shown information about the current state and statistics of the schedulers in the runtime system. On operating systems that allow suspension of other threads, the data within this section reflects what the runtime system looks like when a crash occurs.

The following fields can exist for a process:

Heading. States the scheduler identifier.

If empty, the scheduler was doing some work. If not empty, the scheduler is either in some state of sleep, or suspended.

If not empty, a scheduler internal auxiliary work is scheduled to be done.

The port identifier of the port that is currently executed by the scheduler.

The process identifier of the process that is currently executed by the scheduler. If there is such a process, this entry is followed by the State, Internal State, Program Counter, and CP of that same process. The entries are described in section Process Information.

Notice that this is a snapshot of what the entries are exactly when the crash dump is starting to be generated. Therefore they are most likely different (and more telling) than the entries for the same processes found in the =proc section. If there is no currently running process, only the Current Process entry is shown.

This entry is shown only if there is a current process. It is similar to =proc_stack, except that only the function frames are shown (that is, the stack variables are omitted). Also, only the top and bottom part of the stack are shown. If the stack is small (< 512 slots), the entire stack is shown. Otherwise the entry skipping ## slots is shown, where ## is replaced by the number of slots that has been skipped.

Shows statistics about how many processes and ports of different priorities are scheduled on this scheduler.

This entry is normally not shown. It signifies that getting the rest of the information about this scheduler failed for some reason.

Under the tag =memory is shown information similar to what can be obtained on a living node with erlang:memory().

Under the tags =hash_table:<table_name> and =index_table:<table_name> is shown internal tables. These are mostly of interest for runtime system developers.

Under the tag =allocated_areas is shown information similar to what can be obtained on a living node with erlang:system_info(allocated_areas).

Under the tag =allocator:<A> is shown various information about allocator <A>. The information is similar to what can be obtained on a living node with erlang:system_info({allocator, <A>}). For more information, see also erts_alloc(3).

The Erlang crashdump contains a listing of each living Erlang process in the system. The following fields can exist for a process:

Heading. States the process identifier.

The state of the process. This can be one of the following:

The process was scheduled to run but is currently not running ("in the run queue").
The process was waiting for something (in receive).
The process was currently running. If the BIF erlang:halt/1 was called, this was the process calling it.
The process was on its way to exit.
This is bad luck, the process was garbage collecting when the crash dump was written. The rest of the information for this process is limited.
The process is suspended, either by the BIF erlang:suspend_process/1 or because it tries to write to a busy port.

The registered name of the process, if any.

The entry point of the process, that is, what function was referenced in the spawn or spawn_link call that started the process.

The current function of the process. These fields do not always exist.

The parent of the process, that is, the process that executed spawn or spawn_link.

The date and time when the process was started.

The number of messages in the process' message queue.

The number of allocated heap fragments.

Size of fragmented heap data, in words. This is data either created by messages sent to the process or by the Erlang BIFs. This amount depends on so many things that this field is usually uninteresting.

Process IDs of processes linked to this one. Can also contain ports. If process monitoring is used, this field also tells in which direction the monitoring is in effect. That is, a link "to" a process tells you that the "current" process was monitoring the other, and a link "from" a process tells you that the other process was monitoring the current one.

The number of reductions consumed by the process.

The size of the stack and heap, in words (they share memory segment).

The size of the "old heap", in words. The Erlang virtual machine uses generational garbage collection with two generations. There is one heap for new data items and one for the data that has survived two garbage collections. The assumption (which is almost always correct) is that data surviving two garbage collections can be "tenured" to a heap more seldom garbage collected, as they will live for a long period. This is a usual technique in virtual machines. The sum of the heaps and stack together constitute most of the allocated memory of the process.

The amount of unused memory on each heap, in words. This information is usually useless.

The total memory used by this process, in bytes. This includes call stack, heap, and internal structures. Same as erlang:process_info(Pid,memory).

The current instruction pointer. This is only of interest for runtime system developers. The function into which the program counter points is the current function of the process.

The continuation pointer, that is, the return address for the current call. Usually useless for other than runtime system developers. This can be followed by the function into which the CP points, which is the function calling the current function.

The number of live argument registers. The argument registers if any are live will follow. These can contain the arguments of the function if they are not yet moved to the stack.

A more detailed internal representation of the state of this process.

See also section Process Data.

This section lists the open ports, their owners, any linked processes, and the name of their driver or external process.

This section contains information about all the ETS tables in the system. The following fields are of interest for each table:

Heading. States the table owner (a process identifier).

The identifier for the table. If the table is a named_table, this is the name.

The table name, regardless of if it is a named_table or not.

If the table is a hash table, that is, if it is not an ordered_set.

If the table is a hash table. Contains statistics about the table, such as the maximum, minimum, and average chain length. Having a maximum much larger than the average, and a standard deviation much larger than the expected standard deviation is a sign that the hashing of the terms behaves badly for some reason.

If the table is an ordered_set. (The number of elements is the same as the number of objects in the table.)

If the table is fixed using ets:safe_fixtable/2 or some internal mechanism.

The number of objects in the table.

The number of words allocated to data in the table.

The table type, that is, set, bag, duplicate_bag, or ordered_set.

If the table was compressed.

The protection of the table.

If write_concurrency was enabled for the table.

If read_concurrency was enabled for the table.

This section contains information about all the timers started with the BIFs erlang:start_timer/3 and erlang:send_after/3. The following fields exist for each timer:

Heading. States the timer owner (a process identifier), that is, the process to receive the message when the timer expires.

The message to be sent.

Number of milliseconds left until the message would have been sent.

If the Erlang node was alive, that is, set up for communicating with other nodes, this section lists the connections that were active. The following fields can exist:

The node name.

If the node was not distributed.

Heading for a visible node, that is, an alive node with a connection to the node that crashed. States the channel number for the node.

Heading for a hidden node. A hidden node is the same as a visible node, except that it is started with the "-hidden" flag. States the channel number for the node.

Heading for a node that was connected to the crashed node earlier. References (that is, process or port identifiers) to the not connected node existed at the time of the crash. States the channel number for the node.

The name of the remote node.

The port controlling communication with the remote node.

An integer (1-3) that together with the node name identifies a specific instance of the node.

The local process was monitoring the remote process at the time of the crash.

The remote process was monitoring the local process at the time of the crash.

A link existed between the local process and the remote process at the time of the crash.

This section contains information about all loaded modules.

First, the memory use by the loaded code is summarized:

Code that is the current latest version of the modules.

Code where there exists a newer version in the system, but the old version is not yet purged.

Then, all loaded modules are listed. The following fields exist:

Heading. States the module name.

Memory use for the loaded code, in bytes.

Memory use for the old code, in bytes.

Module attributes for the current code. This field is decoded when looked at by the Crashdump Viewer tool.

Module attributes for the old code, if any. This field is decoded when looked at by the Crashdump Viewer tool.

Compilation information (options) for the current code. This field is decoded when looked at by the Crashdump Viewer tool.

Compilation information (options) for the old code, if any. This field is decoded when looked at by the Crashdump Viewer tool.

This section lists all funs. The following fields exist for each fun:

Heading.

The name of the module where the fun was defined.

Identifiers.

The address of the fun's code.

The number of references to the fun.

For each process there is at least one =proc_stack and one =proc_heap tag, followed by the raw memory information for the stack and heap of the process.

For each process there is also a =proc_messages tag if the process message queue is non-empty, and a =proc_dictionary tag if the process dictionary (the put/2 and get/1 thing) is non-empty.

The raw memory information can be decoded by the Crashdump Viewer tool. You can then see the stack dump, the message queue (if any), and the dictionary (if any).

The stack dump is a dump of the Erlang process stack. Most of the live data (that is, variables currently in use) are placed on the stack; thus this can be interesting. One has to "guess" what is what, but as the information is symbolic, thorough reading of this information can be useful. As an example, we can find the state variable of the Erlang primitive loader online (5) and (6) in the following example:

(1)  3cac44   Return addr 0x13BF58 (<terminate process normally>)
(2)  y(0)     ["/view/siri_r10_dev/clearcase/otp/erts/lib/kernel/ebin",
(3)            "/view/siri_r10_dev/clearcase/otp/erts/lib/stdlib/ebin"]
(4)  y(1)     <0.1.0>
(5)  y(2)     {state,[],none,#Fun<erl_prim_loader.6.7085890>,undefined,#Fun<erl_prim_loader.7.9000327>,
(6)            #Fun<erl_prim_loader.8.116480692>,#Port<0.2>,infinity,#Fun<erl_prim_loader.9.10708760>}
(7)  y(3)     infinity    

When interpreting the data for a process, it is helpful to know that anonymous function objects (funs) are given the following:

  • A name constructed from the name of the function in which they are created
  • A number (starting with 0) indicating the number of that fun within that function

This section presents all the atoms in the system. This is only of interest if one suspects that dynamic generation of atoms can be a problem, otherwise this section can be ignored.

Notice that the last created atom is shown first.

The format of the crash dump evolves between OTP releases. Some information described here may not apply to your version. A description like this will never be complete; it is meant as an explanation of the crash dump in general and as a help when trying to find application errors, not as a complete specification.