2 How to interpret the Erlang crash dumps
This document describes the erl_crash.dump file generated upon abnormal exit of the Erlang runtime system.
Important: For OTP release R9C the Erlang crash dump has had a major facelift. This means that the information in this document will not be directly applicable for older dumps. However, if you use the Crashdump Viewer tool on older dumps, the crash dumps are translated into a format similar to this.
The system will write the crash dump in the current directory of the emulator or in the file pointed out by the environment variable (whatever that means on the current operating system) ERL_CRASH_DUMP. For a crash dump to be written, there has to be a writable file system mounted.
Crash dumps are written mainly for one of two reasons: either the builtin function erlang:halt/1 is called explicitly with a string argument from running Erlang code, or else the runtime system has detected an error that cannot be handled. The most usual reason that the system can't handle the error is that the cause is external limitations, such as running out of memory. A crash dump due to an internal error may be caused by the system reaching limits in the emulator itself (like the number of atoms in the system, or too many simultaneous ets tables). Usually the emulator or the operating system can be reconfigured to avoid the crash, which is why interpreting the crash dump correctly is important.
The erlang crash dump is a readable text file, but it might not be very easy to read. Using the Crashdump Viewer tool in the observer application will simplify the task. This is an HTML based tool for browsing Erlang crash dumps.
2.1 General information
The first part of the dump shows the creation time for the dump, a slogan indicating the reason for the dump, the system version, of the node from which the dump originates, the compile time of the emulator running the originating node and the number of atoms in the atom table.
Reasons for crash dumps (slogan)
The reason for the dump is noted in the beginning of the file as Slogan: <reason> (the word "slogan" has historical roots). If the system is halted by the BIF erlang:halt/1, the slogan is the string parameter passed to the BIF, otherwise it is a description generated by the emulator or the (Erlang) kernel. Normally the message should be enough to understand the problem, but nevertheless some messages are described here. Note however that the suggested reasons for the crash are only suggestions. The exact reasons for the errors may vary depending on the local applications and the underlying operating system.
- "<A>: Cannot allocate <N> bytes of memory (of type "<T>")." - The system has run out of memory. <A> is the allocator that failed to allocate memory, <N> is the number of bytes that <A> tried to allocate, and <T> is the memory block type that the memory was needed for. The most common case is that a process stores huge amounts of data. In this case <T> is most often heap, old_heap, heap_frag, or binary. For more information on allocators see erts_alloc(3).
- "<A>: Cannot reallocate <N> bytes of memory (of type "<T>")." - Same as above with the exception that memory was being reallocated instead of being allocated when the system ran out of memory.
- "Unexpected op code N" - Error in compiled code, beam file damaged or error in the compiler.
- "Module Name undefined" | "Function Name undefined" | "No function Name:Name/1" | "No function Name:start/2" - The kernel/stdlib applications are damaged or the start script is damaged.
- "Driver_select called with too large file descriptor N" - The number of file descriptors for sockets exceed 1024 (Unix only). The limit on file-descriptors in some Unix flavors can be set to over 1024, but only 1024 sockets/pipes can be used simultaneously by Erlang (due to limitations in the Unix select call). The number of open regular files is not affected by this.
- "Received SIGUSR1" - The SIGUSR1 signal was sent to the Erlang machine (Unix only).
- "Kernel pid terminated (Who) (Exit-reason)" - The kernel supervisor has detected a failure, usually that the application_controller has shut down (Who = application_controller, Why = shutdown). The application controller may have shut down for a number of reasons, the most usual being that the node name of the distributed Erlang node is already in use. A complete supervisor tree "crash" (i.e., the top supervisors have exited) will give about the same result. This message comes from the Erlang code and not from the virtual machine itself. It is always due to some kind of failure in an application, either within OTP or a "user-written" one. Looking at the error log for your application is probably the first step to take.
- "Init terminating in do_boot ()" - The primitive Erlang boot sequence was terminated, most probably because the boot script has errors or cannot be read. This is usually a configuration error - the system may have been started with a faulty -boot parameter or with a boot script from the wrong version of OTP.
- "Could not start kernel pid (Who) ()" - One of the kernel processes could not start. This is probably due to faulty arguments (like errors in a -config argument) or faulty configuration files. Check that all files are in their correct location and that the configuration files (if any) are not damaged. Usually there are also messages written to the controlling terminal and/or the error log explaining what's wrong.
Other errors than the ones mentioned above may occur, as the erlang:halt/1 BIF may generate any message. If the message is not generated by the BIF and does not occur in the list above, it may be due to an error in the emulator. There may however be unusual messages that I haven't mentioned, that still are connected to an application failure. There is a lot more information available, so more thorough reading of the crash dump may reveal the crash reason. The size of processes, the number of ets tables and the Erlang data on each process stack can be useful for tracking down the problem.
Number of atoms
The number of atoms in the system at the time of the crash is shown as Atoms: <number>. Some ten thousands atoms is perfectly normal, but more could indicate that the BIF erlang:list_to_atom/1 is used to dynamically generate a lot of different atoms, which is never a good idea.
2.2 Memory information
Under the tag =memory you will find information similar to what you can obtain on a living node with erlang:memory().
2.3 Internal table information
The tags =hash_table:<table_name> and =index_table:<table_name> presents internal tables. These are mostly of interest for runtime system developers.
2.4 Allocated areas
Under the tag =allocated_areas you will find information similar to what you can obtain on a living node with erlang:system_info(allocated_areas).
2.5 Allocator
Under the tag =allocator:<A> you will find various information about allocator <A>. The information is similar to what you can obtain on a living node with erlang:system_info({allocator, <A>}). For more information see the documentation of erlang:system_info({allocator, <A>}), and the erts_alloc(3) documentation.
2.6 Process information
The Erlang crashdump contains a listing of each living Erlang process in the system. The process information for one process may look like this (line numbers have been added):
The following fields can exist for a process:
- =proc:<pid>
- Heading, states the process identifier
- State
-
The state of the process. This can be one of the following:
- Scheduled - The process was scheduled to run but not currently running ("in the run queue").
- Waiting - The process was waiting for something (in receive).
- Running - The process was currently running. If the BIF erlang:halt/1 was called, this was the process calling it.
- Exiting - The process was on its way to exit.
- Garbing - This is bad luck, the process was garbage collecting when the crash dump was written, the rest of the information for this process is limited.
- Suspended - The process is suspended, either by the BIF erlang:suspend_process/1 or because it is trying to write to a busy port.
- Registered name
- The registered name of the process, if any.
- Spawned as
- The entry point of the process, i.e., what function was referenced in the spawn or spawn_link call that started the process.
- Last scheduled in for | Current call
- The current function of the process. These fields will not always exist.
- Spawned by
- The parent of the process, i.e. the process which executed spawn or spawn_link.
- Started
- The date and time when the process was started.
- Message queue length
- The number of messages in the process' message queue.
- Number of heap fragments
- The number of allocated heap fragments.
- Heap fragment data
- Size of fragmented heap data. This is data either created by messages being sent to the process or by the Erlang BIFs. This amount depends on so many things that this field is utterly uninteresting.
- Link list
- Process id's of processes linked to this one. May also contain ports. If process monitoring is used, this field also tells in which direction the monitoring is in effect, i.e., a link being "to" a process tells you that the "current" process was monitoring the other and a link "from" a process tells you that the other process was monitoring the current one.
- Reductions
- The number of reductions consumed by the process.
- Stack+heap
- The size of the stack and heap (they share memory segment)
- OldHeap
- The size of the "old heap". The Erlang virtual machine uses generational garbage collection with two generations. There is one heap for new data items and one for the data that have survived two garbage collections. The assumption (which is almost always correct) is that data that survive two garbage collections can be "tenured" to a heap more seldom garbage collected, as they will live for a long period. This is a quite usual technique in virtual machines. The sum of the heaps and stack together constitute most of the process's allocated memory.
- Heap unused, OldHeap unused
- The amount of unused memory on each heap. This information is usually useless.
- Stack
- If the system uses shared heap, the fields Stack+heap, OldHeap, Heap unused and OldHeap unused do not exist. Instead this field presents the size of the process' stack.
- Program counter
- The current instruction pointer. This is only interesting for runtime system developers. The function into which the program counter points is the current function of the process.
- CP
- The continuation pointer, i.e. the return address for the current call. Usually useless for other than runtime system developers. This may be followed by the function into which the CP points, which is the function calling the current function.
- Arity
- The number of live argument registers. The argument registers, if any are live, will follow. These may contain the arguments of the function if they are not yet moved to the stack.
See also the section about process data.
2.7 Port information
This section lists the open ports, their owners, any linked processed, and the name of their driver or external process.
2.8 ETS tables
This section contains information about all the ETS tables in the system. The following fields are interesting for each table:
- =ets:<owner>
- Heading, states the owner of the table (a process identifier)
- Table
- The identifier for the table. If the table is a named_table, this is the name.
- Name
- The name of the table, regardless of whether it is a named_table or not.
- Buckets
- This occurs if the table is a hash table, i.e. if it is not an ordered_set.
- Ordered set (AVL tree), Elements
- This occurs only if the table is an ordered_set. (The number of elements is the same as the number of objects in the table.)
- Objects
- The number of objects in the table
- Words
- The number of words (usually 4 bytes/word) allocated to data in the table.
2.9 Timers
This section contains information about all the timers started with the BIFs erlang:start_timer/3 and erlang:send_after/3. The following fields exists for each timer:
- =timer:<owner>
- Heading, states the owner of the timer (a process identifier) i.e. the process to receive the message when the timer expires.
- Message
- The message to be sent.
- Time left
- Number of milliseconds left until the message would have been sent.
2.10 Distribution information
If the Erlang node was alive, i.e., set up for communicating with other nodes, this section lists the connections that were active. The following fields can exist:
- =node:<node_name>
- The name of the node
- no_distribution
- This will only occur if the node was not distributed.
- =visible_node:<channel>
- Heading for a visible nodes, i.e. an alive node with a connection to the node that crashed. States the channel number for the node.
- =hidden_node:<channel>
- Heading for a hidden node. A hidden node is the same as a visible node, except that it is started with the "-hidden" flag. States the channel number for the node.
- =not_connected:<channel>
- Heading for a node which is has been connected to the crashed node earlier. References (i.e. process or port identifiers) to the not connected node existed at the time of the crash. exist. States the channel number for the node.
- Name
- The name of the remote node.
- Controller
- The port which controls the communication with the remote node.
- Creation
- An integer (1-3) which together with the node name identifies a specific instance of the node.
- Remote monitoring: <local_proc> <remote_proc>
- The local process was monitoring the remote process at the time of the crash.
- Remotely monitored by: <local_proc> <remote_proc>
- The remote process was monitoring the local process at the time of the crash.
- Remote link: <local_proc> <remote_proc>
- A link existed between the local process and the remote process at the time of the crash.
2.11 Loaded module information
This section contains information about all loaded modules. First, the memory usage by loaded code is summarized. There is one field for "Current code" which is code that is the current latest version of the modules. There is also a field for "Old code" which is code where there exists a newer version in the system, but the old version is not yet purged. The memory usage is in bytes.
All loaded modules are then listed. The following fields exist:
- =mod:<module_name>
- Heading, and the name of the module.
- Current size
- Memory usage for the loaded code in bytes
- Old size
- Memory usage for the old code, if any.
- Current attributes
- Module attributes for the current code. This field is decoded when looked at by the Crashdump Viewer tool.
- Old attributes
- Module attributes for the old code, if any. This field is decoded when looked at by the Crashdump Viewer tool.
- Current compilation info
- Compilation information (options) for the current code. This field is decoded when looked at by the Crashdump Viewer tool.
- Old compilation info
- Compilation information (options) for the old code, if any. This field is decoded when looked at by the Crashdump Viewer tool.
2.12 Fun information
In this section, all funs are listed. The following fields exist for each fun:
- =fun
- Heading
- Module
- The name of the module where the fun was defined.
- Uniq, Index
- Identifiers
- Address
- The address of the fun's code.
- Native_address
- The address of the fun's code when HiPE is enabled.
- Refc
- The number of references to the fun.
2.13 Process Data
For each process there will be at least one =proc_stack and one =proc_heap tag followed by the raw memory information for the stack and heap of the process.
For each process there will also be a =proc_messages tag if the process' message queue is non-empty and a =proc_dictionary tag if the process' dictionary (the put/2 and get/1 thing) is non-empty.
The raw memory information can be decoded by the Crashdump Viewer tool. You will then be able to see the stack dump, the message queue (if any) and the dictionary (if any).
The stack dump is a dump of the Erlang process stack. Most of the live data (i.e., variables currently in use) are placed on the stack; thus this can be quite interesting. One has to "guess" what's what, but as the information is symbolic, thorough reading of this information can be very useful. As an example we can find the state variable of the Erlang primitive loader on line (5) in the example below:
(1) 3cac44 Return addr 0x13BF58 (<terminate process normally>) (2) y(0) ["/view/siri_r10_dev/clearcase/otp/erts/lib/kernel/ebin","/view/siri_r10_dev/ (3) clearcase/otp/erts/lib/stdlib/ebin"] (4) y(1) <0.1.0> (5) y(2) {state,[],none,#Fun<erl_prim_loader.6.7085890>,undefined,#Fun<erl_prim_loader.7.9000327>,#Fun<erl_prim_loader.8.116480692>,#Port<0.2>,infinity,#Fun<erl_prim_loader.9.10708760>} (6) y(3) infinity
When interpreting the data for a process, it is helpful to know that anonymous function objects (funs) are given a name constructed from the name of the function in which they are created, and a number (starting with 0) indicating the number of that fun within that function.
2.14 Atoms
Now all the atoms in the system are written. This is only interesting if one suspects that dynamic generation of atoms could be a problem, otherwise this section can be ignored.
Note that the last created atom is printed first.
2.15 Disclaimer
The format of the crash dump evolves between releases of OTP. Some information here may not apply to your version. A description as this will never be complete; it is meant as an explanation of the crash dump in general and as a help when trying to find application errors, not as a complete specification.