8 External Term Format

8.1 Introduction

The external term format is mainly used in the distribution mechanism of Erlang.

Since Erlang has a fixed number of types, there is no need for a programmer to define a specification for the external format used within some application. All Erlang terms has an external representation and the interpretation of the different terms are application specific.

In Erlang the BIF term_to_binary/1,2 is used to convert a term into the external format. To convert binary data encoding a term the BIF binary_to_term/1 is used.

The distribution does this implicitly when sending messages across node boundaries.

The overall format of the term format is:

1	1	N
131	Tag	Data

Table 8.1:

Note

When messages are passed between connected nodes and a distribution header is used, the first byte containing the version number (131) is omitted from the terms that follow the distribution header. This since the version number is implied by the version number in the distribution header.

A compressed term looks like this:

1	1	4	N
131	80	UncompressedSize	Zlib-compressedData

Table 8.2:

Uncompressed Size (unsigned 32 bit integer in big-endian byte order) is the size of the data before it was compressed. The compressed data has the following format when it has been expanded:

1	Uncompressed Size
Tag	Data

Table 8.3:

8.2 Distribution header

As of erts version 5.7.2 the old atom cache protocol was dropped and a new one was introduced. This atom cache protocol introduced the distribution header. Nodes with erts versions earlier than 5.7.2 can still communicate with new nodes, but no distribution header and no atom cache will be used.

The distribution header currently only contains an atom cache reference section, but could in the future contain more information. The distribution header precedes one or more Erlang terms on the external format. For more information see the documentation of the protocol between connected nodes in the distribution protocol documentation.

ATOM_CACHE_REF entries with corresponding AtomCacheReferenceIndex in terms encoded on the external format following a distribution header refers to the atom cache references made in the distribution header. The range is 0 <= AtomCacheReferenceIndex < 255, i.e., at most 255 different atom cache references from the following terms can be made.

The distribution header format is:

1	1	1	NumberOfAtomCacheRefs/2+1 \| 0	N \| 0
131	68	NumberOfAtomCacheRefs	Flags	AtomCacheRefs

Table 8.4:

Flags consists of NumberOfAtomCacheRefs/2+1 bytes, unless NumberOfAtomCacheRefs is 0. If NumberOfAtomCacheRefs is 0, Flags and AtomCacheRefs are omitted. Each atom cache reference have a half byte flag field. Flags corresponding to a specific AtomCacheReferenceIndex, are located in flag byte number AtomCacheReferenceIndex/2. Flag byte 0 is the first byte after the NumberOfAtomCacheRefs byte. Flags for an even AtomCacheReferenceIndex are located in the least significant half byte and flags for an odd AtomCacheReferenceIndex are located in the most significant half byte.

The flag field of an atom cache reference has the following format:

1 bit	3 bits
NewCacheEntryFlag	SegmentIndex

Table 8.5:

The most significant bit is the NewCacheEntryFlag. If set, the corresponding cache reference is new. The three least significant bits are the SegmentIndex of the corresponding atom cache entry. An atom cache consists of 8 segments each of size 256, i.e., an atom cache can contain 2048 entries.

After flag fields for atom cache references, another half byte flag field is located which has the following format:

3 bits	1 bit
CurrentlyUnused	LongAtoms

Table 8.6:

The least significant bit in that half byte is the LongAtoms flag. If it is set, 2 bytes are used for atom lengths instead of 1 byte in the distribution header. However, the current emulator cannot handle long atoms, so it will currently always be 0.

After the Flags field follow the AtomCacheRefs. The first AtomCacheRef is the one corresponding to AtomCacheReferenceIndex 0. Higher indices follows in sequence up to index NumberOfAtomCacheRefs - 1.

If the NewCacheEntryFlag for the next AtomCacheRef has been set, a NewAtomCacheRef on the following format will follow:

1	1 \| 2	Length
InternalSegmentIndex	Length	AtomText

Table 8.7:

InternalSegmentIndex together with the SegmentIndex completely identify the location of an atom cache entry in the atom cache. Length is number of one byte characters that the atom text consists of. Length is a two byte big endian integer if the LongAtoms flag has been set, otherwise a one byte integer. Subsequent CachedAtomRefs with the same SegmentIndex and InternalSegmentIndex as this NewAtomCacheRef will refer to this atom until a new NewAtomCacheRef with the same SegmentIndex and InternalSegmentIndex appear.

If the NewCacheEntryFlag for the next AtomCacheRef has not been set, a CachedAtomRef on the following format will follow:

InternalSegmentIndex

Table 8.8:

InternalSegmentIndex together with the SegmentIndex identify the location of the atom cache entry in the atom cache. The atom corresponding to this CachedAtomRef is the latest NewAtomCacheRef preceding this CachedAtomRef in another previously passed distribution header.

8.3 ATOM_CACHE_REF

1	1
82	AtomCacheReferenceIndex

Table 8.9:

Refers to the atom with AtomCacheReferenceIndex in the distribution header.

8.4 SMALL_INTEGER_EXT

1	1
97	Int

Table 8.10:

Unsigned 8 bit integer.

8.5 INTEGER_EXT

1	4
98	Int

Table 8.11:

Signed 32 bit integer in big-endian format (i.e. MSB first)

8.6 FLOAT_EXT

1	31
99	Float String

Table 8.12:

A float is stored in string format. the format used in sprintf to format the float is "%.20e" (there are more bytes allocated than necessary). To unpack the float use sscanf with format "%lf".

This term is used in minor version 0 of the external format; it has been superseded by NEW_FLOAT_EXT .

8.7 ATOM_EXT

1	2	Len
100	Len	AtomName

Table 8.13:

An atom is stored with a 2 byte unsigned length in big-endian order, followed by Len numbers of 8 bit characters that forms the AtomName. Note: The maximum allowed value for Len is 255.

8.8 REFERENCE_EXT

1	N	4	1
101	Node	ID	Creation

Table 8.14:

Encode a reference object (an object generated with make_ref/0). The Node term is an encoded atom, i.e. ATOM_EXT, SMALL_ATOM_EXT or ATOM_CACHE_REF. The ID field contains a big-endian unsigned integer, but should be regarded as uninterpreted data since this field is node specific. Creation is a byte containing a node serial number that makes it possible to separate old (crashed) nodes from a new one.

In ID, only 18 bits are significant; the rest should be 0. In Creation, only 2 bits are significant; the rest should be 0. See NEW_REFERENCE_EXT.

8.9 PORT_EXT

1	N	4	1
102	Node	ID	Creation

Table 8.15:

Encode a port object (obtained form open_port/2). The ID is a node specific identifier for a local port. Port operations are not allowed across node boundaries. The Creation works just like in REFERENCE_EXT.

8.10 PID_EXT

1	N	4	4	1
103	Node	ID	Serial	Creation

Table 8.16:

Encode a process identifier object (obtained from spawn/3 or friends). The ID and Creation fields works just like in REFERENCE_EXT, while the Serial field is used to improve safety. In ID, only 15 bits are significant; the rest should be 0.

8.11 SMALL_TUPLE_EXT

1	1	N
104	Arity	Elements

Table 8.17:

SMALL_TUPLE_EXT encodes a tuple. The Arity field is an unsigned byte that determines how many element that follows in the Elements section.

8.12 LARGE_TUPLE_EXT

1	4	N
105	Arity	Elements

Table 8.18:

Same as SMALL_TUPLE_EXT with the exception that Arity is an unsigned 4 byte integer in big endian format.

8.13 NIL_EXT

106

Table 8.19:

The representation for an empty list, i.e. the Erlang syntax [].

8.14 STRING_EXT

1	2	Len
107	Length	Characters

Table 8.20:

String does NOT have a corresponding Erlang representation, but is an optimization for sending lists of bytes (integer in the range 0-255) more efficiently over the distribution. Since the Length field is an unsigned 2 byte integer (big endian), implementations must make sure that lists longer than 65535 elements are encoded as LIST_EXT.

8.15 LIST_EXT

1	4
108	Length	Elements	Tail

Table 8.21:

Length is the number of elements that follows in the Elements section. Tail is the final tail of the list; it is NIL_EXT for a proper list, but may be anything type if the list is improper (for instance [a|b]).

8.16 BINARY_EXT

1	4	Len
109	Len	Data

Table 8.22:

Binaries are generated with bit syntax expression or with list_to_binary/1, term_to_binary/1, or as input from binary ports. The Len length field is an unsigned 4 byte integer (big endian).

8.17 SMALL_BIG_EXT

1	1	1	n
110	n	Sign	d(0) ... d(n-1)

Table 8.23:

Bignums are stored in unary form with a Sign byte that is 0 if the binum is positive and 1 if is negative. The digits are stored with the LSB byte stored first. To calculate the integer the following formula can be used:
B = 256
(d0*B^0 + d1*B^1 + d2*B^2 + ... d(N-1)*B^(n-1))

8.18 LARGE_BIG_EXT

1	4	1	n
111	n	Sign	d(0) ... d(n-1)

Table 8.24:

Same as SMALL_BIG_EXT with the difference that the length field is an unsigned 4 byte integer.

8.19 NEW_REFERENCE_EXT

1	2	N	1	N'
114	Len	Node	Creation	ID ...

Table 8.25:

Node and Creation are as in REFERENCE_EXT.

ID contains a sequence of big-endian unsigned integers (4 bytes each, so N' is a multiple of 4), but should be regarded as uninterpreted data.

N' = 4 * Len.

In the first word (four bytes) of ID, only 18 bits are significant, the rest should be 0. In Creation, only 2 bits are significant, the rest should be 0.

NEW_REFERENCE_EXT was introduced with distribution version 4. In version 4, N' should be at most 12.

See REFERENCE_EXT).

8.20 SMALL_ATOM_EXT

1	1	Len
115	Len	AtomName

Table 8.26:

An atom is stored with a 1 byte unsigned length, followed by Len numbers of 8 bit characters that forms the AtomName. Longer atoms can be represented by ATOM_EXT. Note the SMALL_ATOM_EXT was introduced in erts version 5.7.2 and require a small atom distribution flag exchanged in the distribution handshake.

8.21 FUN_EXT

1	4	N1	N2	N3	N4	N5
117	NumFree	Pid	Module	Index	Uniq	Free vars ...

Table 8.27:

Pid: is a process identifier as in PID_EXT. It represents the process in which the fun was created.
Module: is an encoded as an atom, using ATOM_EXT, SMALL_ATOM_EXT or ATOM_CACHE_REF. This is the module that the fun is implemented in.
Index: is an integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. It is typically a small index into the module's fun table.
Uniq: is an integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. Uniq is the hash value of the parse for the fun.
Free vars: is NumFree number of terms, each one encoded according to its type.

8.22 NEW_FUN_EXT

1	4	1	16	4	4	N1	N2	N3	N4	N5
112	Size	Arity	Uniq	Index	NumFree	Module	OldIndex	OldUniq	Pid	Free Vars

Table 8.28:

This is the new encoding of internal funs: fun F/A and fun(Arg1,..) -> ... end.

Size: is the total number of bytes, including the Size field.
Arity: is the arity of the function implementing the fun.
Uniq: is the 16 bytes MD5 of the significant parts of the Beam file.
Index: is an index number. Each fun within a module has an unique index. Index is stored in big-endian byte order.
NumFree: is the number of free variables.
Module: is an encoded as an atom, using ATOM_EXT, SMALL_ATOM_EXT or ATOM_CACHE_REF. This is the module that the fun is implemented in.
OldIndex: is an integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. It is typically a small index into the module's fun table.
OldUniq: is an integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. Uniq is the hash value of the parse tree for the fun.
Pid: is a process identifier as in PID_EXT. It represents the process in which the fun was created.
Free vars: is NumFree number of terms, each one encoded according to its type.

8.23 EXPORT_EXT

1	N1	N2	N3
113	Module	Function	Arity

Table 8.29:

This term is the encoding for external funs: fun M:F/A.

Module and Function are atoms (encoded using ATOM_EXT, SMALL_ATOM_EXT or ATOM_CACHE_REF).

Arity is an integer encoded using SMALL_INTEGER_EXT.

8.24 BIT_BINARY_EXT

1	4	1	Len
77	Len	Bits	Data

Table 8.30:

This term represents a bitstring whose length in bits is not a multiple of 8 (created using the bit syntax in R12B and later). The Len field is an unsigned 4 byte integer (big endian). The Bits field is the number of bits that are used in the last byte in the data field, counting from the most significant bit towards the least significant.

8.25 NEW_FLOAT_EXT

1	8
70	IEEE float

Table 8.31:

A float is stored as 8 bytes in big-endian IEEE format.

This term is used in minor version 1 of the external format.