More Optimizations in the Compiler and JIT
This post explores the enhanced type-based optimizations and the other performance improvements in Erlang/OTP 26.
What to expect of the JIT in OTP 26 #
In OTP 25, the compiler was updated to embed type information in the BEAM file and the JIT was extended to emit better code based on that type information. Those improvements were described in the blog post Type-Based Optimizations in the JIT.
As mentioned in that blog post, there were limitations in both the compiler and the JIT that prevented many optimizations. In OTP 26, the compiler will produce better type information and the JIT will take better advantage of the improved type information, typically resulting in fewer redundant type tests and smaller native code size.
A new BEAM instruction introduced in OTP 26 makes record updates faster by a small but measurable amount.
The most noticable performance improvements in OTP 26 are probably for
matching and construction of binaries using the bit syntax. Those
improvements, combined with changes to the base64
module itself,
makes encoding to Base64 about 4 times as fast and decoding from
Base64 more than 3 times as fast.
Please try this at home! #
While this blog post will show many examples of generated code, I have attempted to explain the optimizations in English as well. Feel free to skip the code examples.
On the other hand, if you want more code examples…
To examine the native code for loaded modules, start the runtime system like this:
erl +JDdump true
The native code for all modules that are loaded will be dumped to files with the
extension .asm
.
To examine the BEAM code for a module, use the -S
option when
compiling. For example:
erlc -S base64.erl
Quick overview of type-based optimizations in OTP 25 #
Let’s quickly summarize the type-based optimizations in OTP 25. For more details, see the aformentioned blog post.
First consider an addition of two values with nothing known about their types:
add1(X, Y) ->
X + Y.
The BEAM code looks like this:
{gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
return.
Without any information about the operands, the JIT must emit code that can handle all possible types for the operands. For the x86_64 architecture, 14 native instructions are needed.
If the type of the operands are known to be integers sufficiently small making overflow impossible, the JIT needs to emit only 5 native instructions for the addition.
Here is an example where the types and ranges of the operands for the
+
operator are known:
add5(X, Y) when X =:= X band 16#3FF,
Y =:= Y band 16#3FF ->
X + Y.
The BEAM code for this function is as follows:
{gc_bif,'band',{f,24},2,[{x,0},{integer,1023}],{x,2}}.
{test,is_eq_exact,
{f,24},
[{tr,{x,0},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
{gc_bif,'band',{f,24},2,[{x,1},{integer,1023}],{x,2}}.
{test,is_eq_exact,
{f,24},
[{tr,{x,1},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
{x,0}}.
return.
The register operands ({x,0}
and {x,1}
) have now been annotated with
type information:
{tr,Register,Type}
That is, each register operand is a three-tuple with tr
as the first
element. tr
stands for typed register. The second element is the
BEAM register ({x,0}
or {x,1}
in this case), and the third element
is the type of the register in the compiler’s internal type
representation. {t_integer,{0,1023}}
means that the value is an
integer in the inclusive range 0 through 1023.
With that type information, the JIT emits the following native code
for the +
operator:
# i_plus_ssjd
# add without overflow check
mov rax, qword ptr [rbx]
mov rsi, qword ptr [rbx+8]
and rax, -16 ; Zero the tag bits
add rax, rsi
mov qword ptr [rbx], rax
(Lines starting with #
are comments emitted by the JIT, while the
text that follows ;
is a comment added by me for clarification.)
The reduction in code size from 14 instructions down to 5 is nice, but
having to express the range check in that convoluted way using band
can hardly be called nice nor natural.
If we try to express the range checks in a more natural way:
add4(X, Y) when is_integer(X), 0 =< X, X < 16#400,
is_integer(Y), 0 =< Y, Y < 16#400 ->
X + Y.
the compiler in OTP 25 will no longer be able to figure out the ranges for the operands. Here is the BEAM code:
{test,is_integer,{f,22},[{x,0}]}.
{test,is_ge,{f,22},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
{test,is_lt,{f,22},[{tr,{x,0},{t_integer,any}},{integer,1024}]}.
{test,is_integer,{f,22},[{x,1}]}.
{test,is_ge,{f,22},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
{test,is_lt,{f,22},[{tr,{x,1},{t_integer,any}},{integer,1024}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
{x,0}}.
return.
Because of that severe limitation in the compiler’s value range analysis, I wrote:
We aim to improve the type analysis and optimizations in OTP 26 and generate better code for this example.
The enhanced type-based optimizations in OTP 26 #
Compiling the same example with OTP 26, the result is:
{test,is_integer,{f,19},[{x,0}]}.
{test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
{test,is_ge,{f,19},[{integer,1023},{tr,{x,0},{t_integer,{0,'+inf'}}}]}.
{test,is_integer,{f,19},[{x,1}]}.
{test,is_ge,{f,19},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
{test,is_ge,{f,19},[{integer,1023},{tr,{x,1},{t_integer,{0,'+inf'}}}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
{x,0}}.
The BEAM instruction for the +
operator now have ranges for its operands.
Let’s look at little bit closer at the first three instructions, which
corresponds to the guard test is_integer(X), 0 =< X, X < 16#400
.
First is the guard check for an integer:
{test,is_integer,{f,19},[{x,0}]}.
It is followed by the guard test 0 =< X
(rewritten to X >= 0
by the compiler):
{test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
As a result of the is_integer/1
test it is known that {x,0}
is an integer.
The third instruction corresponds to X < 16#400
, which the compiler
has rewritten to 16#3FF >= X
(1023 >= X
):
{test,is_ge,{f,19},[{integer,1023},{tr,{x,0},{t_integer,{0,'+inf'}}}]}.
In the type for the {x,0}
register there is something new for
OTP 26. It says that the range is 0 through '+inf'
, that is, from 0 up
to positive infinity. Combining that range with the range from this
instruction, the Erlang compiler can infer that if this instruction
succeeds, the type for {x,0}
is t_integer,{0,1023}}
.
Combining guard tests #
In OTP 25, the JIT would emit native code for each BEAM instruction in the guard individually. When translated individually, the three guards tests for one of the variables each require 11 native instructions, or 33 instructions for all three.
By having the BEAM loader combine the three guard tests into a
single is_int_range
instruction, the JIT is capable of doing a much
better job and emit a mere 6 native instructions.
How is that possible?
As individual BEAM instructions, each guard test needs 5 instructions
to fetch the value from {x,0}
and test that the value is a small
integer. As a combined instruction, that only needs to be done once.
Other parts of the guard tests also become redundant in the combined
instruction and can be omitted. For example, the is_integer/1
type
test will also succeed if its argument is a bignum (an integer
that does not fit in a machine word). Clearly, a bignum will fall well
outside the range 0 through 1023, so if the argument is not a small
integer, the combined guard test will fail immediately.
With those and some other simplifications, we end up with the following native instructions:
# is_int_in_range_fScc
mov rax, qword ptr [rbx]
sub rax, 15
test al, 15
short jne label_19
cmp rax, 16368
short ja label_19
The first instruction fetches the value of {x,0}
to the CPU
register rax
:
mov rax, qword ptr [rbx]
The next instruction subtracts the tagged value for the lower
bound of the range. Since the lower bound of the range is 0 and the
tag for small integers is 15, the value that is subtracted
is 16 * 0 + 15
or simply 15. (For small integers, the runtime system
uses the 4 least significant bits of the word as tag bits.)
If the lower bound would have been 1, the value to be subtracted would
have been 16 * 1 + 15
or 31:
sub rax, 15
The subtraction achieves two aims at once. Firstly, it simplifies the
tag test in the next two instructions because if the value of of
{x,0}
is a small integer, the 4 least significants bits will now be
zero:
test al, 15
short jne label_19
The test al, 15
instruction does a bitwise AND operation of the
lower byte of the CPU register rax
, discarding the result but
setting CPU flags depending on the value. The next instruction tests
whether the result was nonzero (the tag was not the tag for a small
integer), in which case the test fails and a jump to the failure
label is made.
The second aim for the subtraction is to simplify the range check.
If the value being tested was below the lower bound, the value
of rax
will be negative after the subtraction.
Since integers are represented in two’s complement notation, a
signed negative integer interpreted as an unsigned integer will be a
very large integer. Therefore, both bounds can be checked at once
using the old trick of treating the value in rax
as unsigned:
cmp rax, 16368
short ja label_19
The cmp rax, 16368
instruction compares the value in rax
with the
difference of the tagged upper bound and the tagged lower bound, that
is:
(16 * 1023 + 15) - (16 * 0 + 15)
ja
stands for “Jump (if) Above”, that is, jump if the CPU flags
indicates that in previous comparison of unsigned integers the first
integer was greater than the second. Since a negative number
represented in two’s complement notation looks like a huge integer
when interpreted as an unsigned integer, short ja label_19
will
transfer control to the failure label for values both below the lower
bound and above the upper bound.
More code generation improvements #
The JIT in OTP 26 generates better code for common combinations of
relational operators. In order to reduce the number of combinations
that the JIT will need to handle, the compiler rewrites the <
operator to >=
if possible. In the previous example, it was shown
that the compiler rewrote X < 1024
to 1023 >= X
.
Let’s look at a contrived example to show (off) a few more improvements in the code generation:
add6(M) when is_map(M) ->
A = map_size(M),
if
9 < A, A < 100 ->
A + 6
end.
The main part of the BEAM code looks like this:
{test,is_map,{f,41},[{x,0}]}.
{gc_bif,map_size,{f,0},1,[{tr,{x,0},{t_map,any,any}}],{x,0}}.
{test,is_ge,
{f,43},
[{tr,{x,0},{t_integer,{0,288230376151711743}}},{integer,10}]}.
{test,is_ge,
{f,43},
[{integer,99},{tr,{x,0},{t_integer,{10,288230376151711743}}}]}.
{gc_bif,'+',{f,0},1,[{tr,{x,0},{t_integer,{10,99}}},{integer,6}],{x,0}}.
return.
In OTP 26, the JIT will inline the code for many of the most
frequently used guard BIFs. Here is the native code for the
map_size/1
call:
# bif_map_size_jsd
mov rax, qword ptr [rbx] ; Fetch map from {x,0}
# skipped type check because the argument is always a map
mov rax, qword ptr [rax+6] ; Fetch size of map
shl rax, 4
or al, 15 ; Tag as small integer
mov qword ptr [rbx], rax ; Store size in {x,0}
The two is_ge
instructions are combined by the BEAM loader into
an is_in_range
instruction:
# is_in_range_ffScc
# simplified fetching of BEAM register
mov rdi, rax
# skipped test for small operand since it always small
sub rdi, 175
cmp rdi, 1424
ja label_43
The first instruction is a new optimization in OTP 26. Normally {x,0}
is
fetched using the instruction mov rax, qword ptr [rbx]
. However, in this
case, the last instruction in the previous BEAM instruction is the instruction
mov qword ptr [rbx], rax
. Therefore, since it is known that the contents of
{x,0}
is already in CPU register rax
, the instruction can be simplified
to:
# simplified fetching of BEAM register
mov rdi, rax
The size of a map that will fit in memory on a 64-bit computer is always a small integer, so the test for a small integer is skipped:
# skipped test for small operand since it always small
sub rdi, 175 ; Subtract 16 * 10 + 15
cmp rdi, 1424 ; Compare with (16*99+15)-(16*10+15)
ja label_43
The native code for the +
operator looks like this:
# i_plus_ssjd
# add without overflow check
mov rax, qword ptr [rbx]
add rax, 96 ; 16 * 6 + 0
mov qword ptr [rbx], rax
New BEAM instructions in OTP 26 #
The previous example of combining guard tests showed that the JIT can often generate better code if multiple BEAM instructions are combined into one. While the BEAM loader is capable of combining instructions it is often more practical to let the Erlang compiler emit combined instructions.
OTP 26 introduces two new instructions, each of which replaces a sequence of any number of simpler instructions:
-
update_record
for updating any number of fields in a record. -
bs_match
for matching multiple segments of fixed size.
In OTP 25, the bs_create_bin
instruction for constructing a binary
with any number of segments was introduced, but its full potential for
generating efficient code was not leveraged in OTP 25.
Updating records in OTP 25 #
Consider the following example of a record definition and three functions that update the record:
-record(r, {a,b,c,d,e}).
update_a(R) ->
R#r{a=42}.
update_ce(R) ->
R#r{c=99,e=777}.
update_bcde(R) ->
R#r{b=2,c=3,d=4,e=5}.
In OTP 25 and earlier, the way in which a record is updated depends on both the number of fields being updated and the size of the record.
When a single field in a record is updated, as in update_a/1
, the
setelement/3
BIF is called:
{test,is_tagged_tuple,{f,34},[{x,0},6,{atom,r}]}.
{move,{x,0},{x,1}}.
{move,{integer,42},{x,2}}.
{move,{integer,2},{x,0}}.
{call_ext_only,3,{extfunc,erlang,setelement,3}}.
When updating more than one field but fewer than approximately half of
the fields, as in update_ce/1
, code similar to the following is
emitted:
{test,is_tagged_tuple,{f,37},[{x,0},6,{atom,r}]}.
{allocate,0,1}.
{move,{x,0},{x,1}}.
{move,{integer,777},{x,2}}.
{move,{integer,6},{x,0}}.
{call_ext,3,{extfunc,erlang,setelement,3}}.
{set_tuple_element,{integer,99},{x,0},3}.
{deallocate,0}.
return.
Here the e
field is updated using setelement/3
, followed by
set_tuple_element
to update the c
field destructively. Erlang does
not allow mutation of terms, but here it is done “under the hood” in a
safe way.
When a majority of the fields are updated, as in update_bcde/1
, a
new tuple is built:
{test,is_tagged_tuple,{f,40},[{x,0},6,{atom,r}]}.
{test_heap,7,1}.
{get_tuple_element,{x,0},1,{x,0}}.
{put_tuple2,{x,0},
{list,[{atom,r},
{x,0},
{integer,2},
{integer,3},
{integer,4},
{integer,5}]}}.
return.
Updating records in OTP 26 #
In OTP 26, all records are updated using the new BEAM instruction
update_record
. For example, here is the main part of the BEAM code
for update_1
:
{test,is_tagged_tuple,{f,34},[{x,0},6,{atom,r}]}.
{test_heap,7,1}.
{update_record,{atom,reuse},6,{x,0},{x,0},{list,[2,{integer,42}]}}.
return.
The last operand is a list of positions in the tuple and their corresponding new values.
The first operand, {atom,reuse}
, is a hint to the JIT that it is possible
that the source tuple is already up to date and does not need to be updated.
Another possible value for the hint operand is {atom,copy}
, meaning that
the source tuple is definitely not up to date.
The JIT emits the following native code for the update_record
instruction:
# update_record_aIsdI
mov rax, qword ptr [rbx]
mov rdi, rax
cmp qword ptr [rdi+14], 687
je L130
vmovups xmm0, [rax-2]
vmovups [r15], xmm0
mov qword ptr [r15+16], 687
vmovups ymm0, [rax+22]
vmovups [r15+24], ymm0
lea rax, qword ptr [r15+2]
add r15, 56
L130:
mov qword ptr [rbx], rax
Let’s walk through those instructions. First the value of {x,0}
is fetched:
mov rax, qword ptr [rbx]
Since the hint operand is the atom reuse
, is is possible that it is
unnecessary to copy the tuple. Therefore, the JIT emits an instruction
sequence to test whether the a
field (position 2 in the tuple)
already contains the value 42. If so, the source tuple can be reused:
mov rdi, rax
cmp qword ptr [rdi+14], 687 ; 42
je L130 ; Reuse source tuple
Next follows the copy and update sequence. First the header word for
the tuple and its first element (the r
atom) are copied using
AVX instructions:
vmovups xmm0, [rax-2]
vmovups [r15], xmm0
Next the value 42 is stored into position 2 of the copy of the tuple:
mov qword ptr [r15+16], 687 ; 42
Finally the remaining four elements of the tuple are copied:
vmovups ymm0, [rax+22]
vmovups [r15+24], ymm0
All that remains is to create a tagged pointer to the newly created tuple and increment the heap pointer:
lea rax, qword ptr [r15+2]
add r15, 56
The last instruction stores the tagged pointer to either the original
or the updated tuple to {x,0}
:
L130:
mov qword ptr [rbx], rax
The BEAM code for update_ce/1
is very similar to the code for update_a/1
:
{test,is_tagged_tuple,{f,37},[{x,0},6,{atom,r}]}.
{test_heap,7,1}.
{update_record,{atom,reuse},
6,
{x,0},
{x,0},
{list,[4,{integer,99},6,{integer,777}]}}.
return.
The native code looks like this:
# update_record_aIsdI
mov rax, qword ptr [rbx]
vmovups ymm0, [rax-2]
vmovups [r15], ymm0
mov qword ptr [r15+32], 1599 ; 99
mov rdi, [rax+38]
mov [r15+40], rdi
mov qword ptr [r15+48], 12447 ; 777
lea rax, qword ptr [r15+2]
add r15, 56
mov qword ptr [rbx], rax
Note that the copying and updating is done unconditionally, despite
the reuse
hint. The JIT is free to ignore the hints. When multiple
fields are being updated, the test for whether the update is
unnecessary would be more expensive and it is also much less likely
that all of the fields would turn out to be unchanged. Therefore,
trying to reuse the original tuple is more likely to be a
pessimization rather
than an optimization.
Matching and constructing binaries in OTP 25 #
To explore the optimizations of binaries, the following example will be used:
bin_swap(<<A:8,B:24>>) ->
<<B:24,A:8>>.
Somewhat simplified, the main part of the BEAM code as emitted by the compiler in OTP 25 looks like this:
{test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
{bs_get_position,{x,1},{x,0},2}.
{test,bs_get_integer2,
{f,2},
2,
[{x,1},
{integer,8},
1,
{field_flags,[unsigned,big]}],
{x,2}}.
{test,bs_get_integer2,
{f,2},
3,
[{x,1},
{integer,24},
1,
{field_flags,[unsigned,big]}],
{x,3}}.
{test,bs_test_tail2,{f,2},[{x,1},0]}.
{bs_create_bin,{f,0},
0,4,1,
{x,0},
{list,[{atom,integer},
1,1,nil,
{tr,{x,3},{t_integer,{0,16777215}}},
{integer,24},
{atom,integer},
2,1,nil,
{tr,{x,2},{t_integer,{0,255}}},
{integer,8}]}}.
return.
Let’s walk through the code. The first instruction sets up a match context:
{test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
A match context holds several pieces of information needed for matching a binary.
The next instruction saves information that will be needed if matching of the binary fails for some reason:
{bs_get_position,{x,1},{x,0},2}.
The next two instructions match out two segments as integers (comments added by me):
{test,bs_get_integer2,
{f,2}, % Failure label
2, % Number of live X registers (needed for GC)
[{x,1}, % Match context register
{integer,8}, % Size of segment in units
1, % Unit value
{field_flags,[unsigned,big]}],
{x,2}}. % Destination register
{test,bs_get_integer2,
{f,2},
3,
[{x,1},
{integer,24},
1,
{field_flags,unsigned,big]}],
{x,3}}.
The next instruction makes sure that the end of the binary has now been reached:
{test,bs_test_tail2,{f,2},[{x,1},0]}.
The next instruction creates the binary with the segments swapped:
{bs_create_bin,{f,0},
0,4,1,
{x,0},
{list,[{atom,integer},
1,1,nil,
{tr,{x,3},{t_integer,{0,16777215}}},
{integer,24},
{atom,integer},
2,1,nil,
{tr,{x,2},{t_integer,{0,255}}},
{integer,8}]}}.
Before OTP 25, creation of binaries was done using multiple
instructions, similar to how binary matching is still done in
OTP 25. The reason for creating the bs_create_bin
instruction in OTP 25
was to be able to provide improved error information when construction
of a binary fails, similar to the improved BIF error
information.
When the size of a segment of size 8, 16, 32, or 64 is matched, specialized instructions are used for x86_64. The specialized instructions do everything inline provided that the segment is byte-aligned. (The JIT in OTP 25 for AArch64/ARM64 does not have these specialized instructions.) Here is the instruction for matching a segment of size 8:
# i_bs_get_integer_8_Stfd
mov rcx, qword ptr [rbx+8]
mov rsi, qword ptr [rcx+22]
lea rdx, qword ptr [rsi+8]
cmp rdx, qword ptr [rcx+30]
ja label_25
rex test sil, 7
short je L91
mov edx, 64
call L92
short jmp L90
L91:
mov rdi, qword ptr [rcx+14]
shr rsi, 3
mov qword ptr [rcx+22], rdx
movzx rax, byte ptr [rdi+rsi]
shl rax, 4
or rax, 15
L90:
mov qword ptr [rbx+16], rax
The first two instructions pick up the pointer to the match context and from the match context the current bit offset into the binary:
mov rcx, qword ptr [rbx+8] ; Load pointer to match context
mov rsi, qword ptr [rcx+22] ; Get offset in bits into binary
The next three instructions ensure that the length of the binary is at least 8 bits:
lea rdx, qword ptr [rsi+8] ; Add 8 to the offset
cmp rdx, qword ptr [rcx+30] ; Compare offset+8 with size of binary
ja label_25 ; Fail if the binary is too short
The next five instructions test whether the current byte in the binary is aligned at a byte boundary. If not, a helper code fragment is called:
rex test sil, 7 ; Test the 3 least significant bits
short je L91 ; Jump if 0 (meaning byte-aligned)
mov edx, 64 ; Load size and flags
call L92 ; Call helper fragment
short jmp L90 ; Done
A helper code fragment is a shared block of code that can be called from the native code generated for BEAM instructions, typically to handle cases that are uncommon and/or would require more native instructions than are practial to include inline. Each such code fragment has its own calling convention, typically tailor-made to be as convenient for the caller as possible. (See Further adventures in the JIT for more information about helper code fragments.)
The remaining instructions read one byte from memory, convert it to a
tagged Erlang terms, store it in {x,2}
, and advance the bit offset
in the match context:
L91:
mov rdi, qword ptr [rcx+14] ; Load base pointer for binary
shr rsi, 3 ; Convert bit offset to byte offset
mov qword ptr [rcx+22], rdx ; Update bit offset in match context
movzx rax, byte ptr [rdi+rsi] ; Read one byte from the binary
shl rax, 4 ; Multiply by 16...
or rax, 15 ; ... and add tag for a small integer
L90:
mov qword ptr [rbx+16], rax ; Store extracted integer
When matching a segment of size other than one of the special sizes mentioned earlier, the JIT will always emit a call to a general routine that can handle matching of any integer segment with any aligment, endianness, and signedness.
In OTP 25, the full potential for optimization of the bs_create_bin
instruction is not realized. The construction of each segment is done
by calling a helper routine that builds the segment. Here is the
native for the part of the bs_create_bin
instruction that builds the
integer segments:
# construct integer segment
mov edx, 24
mov rsi, qword ptr [rbx+24]
xor ecx, ecx
lea rdi, qword ptr [rbx-80]
call 4387496416
# construct integer segment
mov edx, 8
mov rsi, qword ptr [rbx+16]
xor ecx, ecx
lea rdi, qword ptr [rbx-80]
call 4387496416
Binary pattern matching in OTP 26 #
In OTP 26, there is a new BEAM bs_match
instruction used for
matching segments with sizes known at compile time. The BEAM code for
the matching code in the function head for bin_swap/1
is as follows:
{test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
{bs_get_position,{x,1},{x,0},2}.
{bs_match,{f,2},
{x,1},
{commands,[{ensure_exactly,32},
{integer,2,{literal,[]},8,1,{x,2}},
{integer,3,{literal,[]},24,1,{x,3}}]}}.
The first two instructions are identical to their OTP 25 counterparts.
The first operand of the bs_match
instruction, {f,2}
, is the
failure label and the second operand {x,2}
is the register holding
the match context. The third operand, {commands,[...]}
, is a list of
matching commands.
The first command in the commands
list, {ensure_exactly,32}
, tests
that the remaining number of bits in the binary being matched is
exactly 32. If not, a jump is made to the failure label.
The second command extracts an integer of 8 bits and stores it in
{x,2}
. The third command extracts an integer of 24 bits and store it
in {x,3}
.
Having matching of multiple segments contained in a single BEAM instruction makes it much easier for the JIT to generate efficient code. Here is what the native code will do:
-
Test that there are at exactly 32 bits left in the binary.
-
If the segment is byte-aligned, read a 4-byte word from the binary and store it in a CPU register.
-
If the segment is not byte-aligned, read an 8-byte word from the binary and shift to extract the 32 bits needed.
-
Shift and mask out 8 bits and tag as an integer. Store into
{x,2}
. -
Shift and mask out 24 bits and tag as an integer. Store into
{x,3}
.
The native code for the bs_match
instruction (slightly simplifed) is
as follows:
# i_bs_match_fS
# ensure_exactly 32
mov rsi, qword ptr [rbx+8]
mov rax, qword ptr [rsi+30]
mov rcx, qword ptr [rsi+22]
sub rax, rcx
cmp rax, 32
jne label_3
# read 32
mov rdi, qword ptr [rsi+14]
add qword ptr [rsi+22], 32
mov rax, rcx
shr rax, 3
add rdi, rax
and ecx, 7
jnz L38
movbe edx, dword ptr [rdi]
add ecx, 32
short jmp L40
L38:
mov rdx, qword ptr [rdi-3]
shr rdx, 24
bswap rdx
L40:
shl rdx, cl
# extract integer 8
mov rax, rdx
# store extracted integer as a small
shr rax, 52
or rax, 15
mov qword ptr [rbx+16], rax
shl rdx, 8
# extract integer 24
shr rdx, 36
or rdx, 15
mov qword ptr [rbx+24], rdx
The first part of the code ensures that there are exactly 32 bits remaining in the binary:
# ensure_exactly 32
mov rsi, qword ptr [rbx+8] ; Get pointer to match context
mov rax, qword ptr [rsi+30] ; Get size of binary in bits
mov rcx, qword ptr [rsi+22] ; Get offset in bits into binary
sub rax, rcx
cmp rax, 32
jne label_3
The next part of the code does not directly correspond to the commands
in the bs_match
BEAM instruction. Instead, the code reads 32 bits
from the binary:
# read 32
mov rdi, qword ptr [rsi+14]
add qword ptr [rsi+22], 32 ; Increment bit offset in match context
mov rax, rcx
shr rax, 3
add rdi, rax
and ecx, 7 ; Test alignment
jnz L38 ; Jump if segment not byte-aligned
; Read 32 bits (4 bytes) byte-aligned and convert to big-endian
movbe edx, dword ptr [rdi]
add ecx, 32
short jmp L40
L38:
; Read a 8-byte word and extract the 32 bits that are needed.
mov rdx, qword ptr [rdi-3]
shr rdx, 24
bswap rdx ; Convert to big-endian
L40:
; Shift the read bytes to the most significant bytes of the word
shl rdx, cl
The 4 bytes read will be converted to big-endian and placed as the
most significant bytes of CPU register rdx
with the rest of the
register zeroed.
The following instructions extracts the 8 bits for the first segment and
stores it as a tagged integer in {x,2}
:
# extract integer 8
mov rax, rdx
# store extracted integer as a small
shr rax, 52
or rax, 15
mov qword ptr [rbx+16], rax
shl rdx, 8
The following instructions extracts the 24 bits for the second segment and
stores it as a tagged integer in {x,3}
:
# extract integer 24
shr rdx, 36
or rdx, 15
mov qword ptr [rbx+24], rdx
Binary construction in OTP 26 #
For binary construction in OTP 26, the compiler emits a
bs_create_bin
BEAM instruction just as in OTP 25. However, the
native code that the JIT in OTP 26 emits for that instruction bears
little resemblance to the native code emitted by OTP 25. The native
code will do the following:
-
Allocate room on the heap for a binary and initialize it with inlined native code. A helper code fragment is called to do a garbage collection if there is not sufficient room left on the heap.
-
Read the integer from
{x,3}
and untag it. -
Read the integer from
{x,2}
and untag it. Combine the value with the previous 24-bit value to obtain a 32-bit value. -
Write the combined 32 bits into the binary.
Here follows the complete native code for the bs_create_bin
instruction (somewhat simplified):
# i_bs_create_bin_jItd
# allocate heap binary
lea rdx, qword ptr [r15+56]
cmp rdx, rsp
short jbe L43
mov ecx, 4
.db 0x90
call 4343630296
L43:
lea rax, qword ptr [r15+2]
mov qword ptr [rbx-120], rax
mov qword ptr [r15], 164
mov qword ptr [r15+8], 4
add r15, 16
mov qword ptr [rbx-64], r15
mov qword ptr [rbx-56], 0
add r15, 8
# accumulate value for integer segment
xor r8d, r8d
mov rdi, qword ptr [rbx+24]
sar rdi, 4
or r8, rdi
# accumulate value for integer segment
shl r8, 8
mov rdi, qword ptr [rbx+16]
sar rdi, 4
or r8, rdi
# construct integer segment from accumulator
bswap r8d
mov rdi, qword ptr [rbx-64]
mov qword ptr [rbx-56], 32
mov dword ptr [rdi], r8d
Let’s walk through it.
The first part of the code, starting with # allocate heap binary
and
ending before the next comment line allocates a heap binary with
inlined native code. The only call to a helper code fragment is in case
there is not sufficient space left on the heap.
Next follows the construction of the segments of the binary.
Instead of writing the value of each segment to memory one at a time, multiple segments are accumulated into a CPU register. Here follows the code for the first segment to be constructed (24 bits):
# accumulate value for integer segment
xor r8d, r8d ; Initialize accumulator
mov rdi, qword ptr [rbx+24] ; Fetch {x,3}
sar rdi, 4 ; Untag
or r8, rdi ; OR into accumulator
Here follows the code for the second segment (8 bits):
# accumulate value for integer segment
shl r8, 8 ; Make room for 8 bits
mov rdi, qword ptr [rbx+16] ; Fetch {x,2}
sar rdi, 4 ; Untag
or r8, rdi ; OR into accumulator
Since there are no segments of the binary left, the accumulated value will be written out to memory:
# construct integer segment from accumulator
bswap r8d ; Make accumulator big-endian
mov rdi, qword ptr [rbx-64] ; Get pointer into binary
mov qword ptr [rbx-56], 32 ; Update size of binary
mov dword ptr [rdi], r8d ; Write 32 bits
Appending to binaries in OTP 25 #
The ancient OTP R12B release introduced an optimization for efficiently appending to a binary. Let’s look at an example to see the optimization in action:
-module(append).
-export([expand/1, expand_bc/1]).
expand(Bin) when is_binary(Bin) ->
expand(Bin, <<>>).
expand(<<B:8,T/binary>>, Acc) ->
expand(T, <<Acc/binary,B:16>>);
expand(<<>>, Acc) ->
Acc.
expand_bc(Bin) when is_binary(Bin) ->
<< <<B:16>> || <<B:8>> <= Bin >>.
Both append:expand/1
and append:expand_bc/1
take a binary and
double its size by expanding each byte to two bytes. For example:
1> append:expand(<<1,2,3>>).
<<0,1,0,2,0,3>>
2> append:expand_bc(<<4,5,6>>).
<<0,4,0,5,0,6>>
Both functions accept only binaries:
3> append:expand(<<1,7:4>>).
** exception error: no function clause matching append:expand(<<1,7:4>>,<<>>)
4> append:expand_bc(<<1,7:4>>).
** exception error: no function clause matching append:expand_bc(<<1,7:4>>)
Before looking at the BEAM code, let’s do some benchmarking using erlperf to find out which function is faster:
erlperf --init_runner_all 'rand:bytes(10_000).' \
'r(Bin) -> append:expand(Bin).' \
'r(Bin) -> append:expand_bc(Bin).'
Code || QPS Time Rel
r(Bin) -> append:expand_bc(Bin). 1 7936 126 us 100%
r(Bin) -> append:expand(Bin). 1 4369 229 us 55%
The expression for the --init_runner_all
option uses
rand:bytes/1 to create a binary with 10,000 random
bytes, which will be passed to both expand functions.
From the benchmark results, it can be seen that the expand_bc/1
function is
almost twice as fast.
To find out why, let’s compare the BEAM code for the two functions. Here is
the instruction that appends to the binary in expand/1
:
{bs_create_bin,{f,0},
0,3,8,
{x,1},
{list,[{atom,append}, % Append operation
1,8,nil,
{tr,{x,1},{t_bitstring,1}}, % Source/destination
{atom,all},
{atom,integer},
2,1,nil,
{tr,{x,2},{t_integer,{0,255}}},
{integer,16}]}}.
The first segment is an append
operation. The operand
{tr,{x,1},{t_bitstring,1}}
denotes both source and destination of
the operation. That is, the binary referenced by {x,1}
will be
mutated. Erlang normally does not allow mutation, but this mutation
is done under the hood in a way not observable from outside. That
makes the append operation much more efficient than it would be if the
source binary had to be copied.
For the binary comprehension in expand_bc/1
, there is a similar
BEAM instruction for appending to the binary:
{bs_create_bin,{f,0},
0,3,1,
{x,1},
{list,[{atom,private_append}, % Private append operation
1,1,nil,
{x,1},
{atom,all},
{atom,integer},
2,1,nil,
{tr,{x,2},{t_integer,{0,255}}},
{integer,16}]}}.
The main difference is that the binary comprehension uses the more
efficient private_append
operation instead of append
.
The append
operation has more overhead because it must produce the
correct result for code such as:
bins(Bin) ->
bins(Bin, <<>>).
bins(<<H,T/binary>>, Acc) ->
[Acc|bins(T, <<Acc/binary,H>>)];
bins(<<>>, Acc) ->
[Acc].
Running it:
1> example:bins(<<"abcde">>).
[<<>>,<<"a">>,<<"ab">>,<<"abc">>,<<"abcd">>,<<"abcde">>]
In the expand/1
function, only the final value binary being appended
to was needed. In bins/1
, all of the intermediate values of binary
are collected in a list. For correctness, the append
operations must
ensure that the binary Acc
is copied before H
is appended to
it. To be able to know when it is necessary to copy the binary, the
append
operation does some extra bookeeping that does not come
for free.
Appending to binaries in OTP 26 #
In OTP 26, there is a new optimization in the compiler that replaces
an append
operation with a private_append
operation whenever it is
correct and safe to do so. This optimization was implemented by Frej
Drejhammar. That is, the optimization will rewrite append:expand/2
to use private_append
, but not examples:bins/2
.
The difference between append:expand/1
and append:expand_bc/1
is now
much smaller:
erlperf --init_runner_all 'rand:bytes(10_000).' \
'r(Bin) -> append:expand(Bin).' \
'r(Bin) -> append:expand_bc(Bin).'
Code || QPS Time Rel
r(Bin) -> append:expand_bc(Bin). 1 13164 75988 ns 100%
r(Bin) -> append:expand(Bin). 1 12419 80550 ns 94%
expand_bc/1
is still a bit faster because the compiler emits
somewhat more efficient binary matching code for it than for the
expand/1
function.
The benefit of is_binary/1
guards #
The expand/1
function has an is_binary/1
guard test that may seem
unnecessary:
expand(Bin) when is_binary(Bin) ->
expand(Bin, <<>>).
The guard test is not necessary for correctness, because expand/2
will raise a function_clause
exception if its argument is not a
binary. However, better code will be generated for expand/2
with
the guard test.
With the guard test, the first BEAM instruction in expand/2
is:
{bs_start_match4,{atom,no_fail},2,{x,0},{x,0}}.
Without the guard test, the first BEAM instruction is:
{test,bs_start_match3,{f,3},2,[{x,0}],{x,2}}.
The bs_start_match4
instruction is more efficient because it does
not have to test that {x,0}
contains a binary.
The benchmark results show measurable increased execution time for
expand/1
if the guard test is removed:
erlperf --init_runner_all 'rand:bytes(10_000).' \
'r(Bin) -> append:expand(Bin).' \
'r(Bin) -> append:expand_bc(Bin).'
Code || QPS Time Rel
r(Bin) -> append:expand_bc(Bin). 1 13273 75366 ns 100%
r(Bin) -> append:expand(Bin). 1 11875 84236 ns 89%
Revisiting the base64
module #
Traditionally, up to OTP 25, the clause in the base64
module that does
most of the work of encoding a binary to Base64 looked like this:
encode_binary(<<B1:8, B2:8, B3:8, Ls/bits>>, A) ->
BB = (B1 bsl 16) bor (B2 bsl 8) bor B3,
encode_binary(Ls,
<<A/bits,(b64e(BB bsr 18)):8,
(b64e((BB bsr 12) band 63)):8,
(b64e((BB bsr 6) band 63)):8,
(b64e(BB band 63)):8>>).
The reason is that matching out segments of size 8 has always been specially optimized and has been much faster than matching out a segment of size 6. That is no longer true in OTP 26. With the improvements in binary matching described in this blog post, the clause can be written in a more natural way:
encode_binary(<<B1:6, B2:6, B3:6, B4:6, Ls/bits>>, A) ->
encode_binary(Ls,
<<A/bits,
(b64e(B1)):8,
(b64e(B2)):8,
(b64e(B3)):8,
(b64e(B4)):8>>);
(This is not the exact code in OTP 26, because of additional features added later.)
The benchmark results for encoding a random binary of 1,000,000 bytes to Base64 for OTP 25 is:
erlperf --init_runner_all 'rand:bytes(1_000_000).' \
'r(Bin) -> base64:encode(Bin).'
Code || QPS Time
r(Bin) -> base64:encode(Bin). 1 61 16489 us
The benchmark results for encoding a random binary of 1,000,000 bytes to Base64 for OTP 26 is:
erlperf --init_runner_all 'rand:bytes(1_000_000).' \
'r(Bin) -> base64:encode(Bin).'
Code || QPS Time
r(Bin) -> base64:encode(Bin). 1 249 4023 us
That is, encoding is about 4 times faster.
Pull requests #
Here are the main pull requests for the optimizations mentioned in this blog post:
- compiler: Improve the type analysis
- JIT: Optimise common combinations of relational operators
- JIT: Minor optimizations, which includes the optimization that avoids fetching an operand that is already in a CPU register.
- compiler: Optimize record updates
- JIT: Optimize binary matching for fixed-width segments
- JIT: Optimize creation of binaries
- compiler:
private_append
optimization for binaries