Type-Based Optimizations in the JIT
This post explores the new type-based optimizations in Erlang/OTP 25 where the compiler embeds type information in the BEAM files to help the JIT (Just-In-Time compiler) to generate better code.
The best of both worlds #
The SSA-based compiler passes introduced in OTP 22 does a sophisticated type analysis, which allows for more optimizations and better code generation. There are, however, limits to what kind of optimizations the Erlang compiler can do because a BEAM file must be possible to load on any BEAM machine running on a 32-bit or 64-bit computer. Therefore, the compiler cannot do optimizations that depend on the size of integers that fit in a machine word or on how Erlang terms are represented.
The JIT (introduced in OTP 24) knows that it is running on a 64-bit
computer and knows how Erlang terms are represented. The JIT is still
limited in how much optimization it can do because it translates a
single BEAM instruction at the time. For example, the +
operator can
add floats or integers of any size or any combination
thereof. Previously executed BEAM instructions might have made it
clear that the operands can only be small integers, but the JIT does
not know that since it only looks at one instruction at the time, and
therefore it must emit native code that handles all possible operands.
In OTP 25, the compiler has been updated to embed type information in the BEAM file and the JIT has been extended to emit better code based on that type information.
The embedded type information is versioned so that we can continue to improve the type-based optimizations in every OTP release. The loader will ignore versions it does not recognize so that the module can still be loaded without the type-based optimizations.
What to expect of the JIT in OTP 25 #
OTP 25 is just the beginning for type-based optimizations. We hope to improve both the type information from the compiler and the optimizations in the JIT in OTP 26.
How much better the native code emitted by the JIT will be depends on the nature of the code in the module.
The most commonly applied optimization is simplified tests. For example, a test for a tuple can frequently be reduced from 5 instructions down to 3 instructions, and a test for small integer operands can frequently be reduced from 5 instructions down to 4 instructions.
Less commonly applied but more significant are the simplifications
that can be made when an integer is known to be “small” (fits in 60
bits). For example, a relational operator (such as <
) used in a
guard can be reduced from 11 instructions down to 4 if the operands
are known to be small integers. This kind of optimization is most
often applied in modules that use binary pattern matching because
integers matched out from a binary have a well-defined range.
In the Erlang/OTP code base, the first kind of optimizations (shaving off one or two instructions) are applied roughly ten times as often as the second kind.
We will see later in this blog post that the optimizations of the
second kind applied to the base64
module resulted in a significant
speed up.
Simplifications of type tests #
Let’s dive right into some examples.
Consider this module:
-module(example).
-export([tuple_matching/1]).
tuple_matching(X) ->
case increment(X) of
{ok,Result} -> Result;
error -> X
end.
increment(X) when is_integer(X) -> {ok,X+1};
increment(_) -> error.
The BEAM code for the tuple_matching/1
function emitted
by the compiler in OTP 24 is (somewhat simplified):
{allocate,1,1}.
{move,{x,0},{y,0}}.
{call,1,{f,5}}.
{test,is_tuple,{f,3},[{x,0}]}.
{get_tuple_element,{x,0},1,{x,0}}.
{deallocate,1}.
return.
{label,3}.
{move,{y,0},{x,0}}.
{deallocate,1}.
return.
The compiler has figured out that the increment/1
returns either the
atom error
or a two-tuple with ok
as the first element. Therefore,
to distinguish between those two possible return values, a single
instruction suffices:
{test,is_tuple,{f,3},[{x,0}]}.
There is no need to explicitly test for the value error
because it
must be error
if it is not a tuple. Similarly, there is no need
to test that the first element of the tuple is ok
because it must be.
In OTP 24, the JIT translates that instruction to a sequence of 5 native instructions for x86_64:
# i_is_tuple_fs
mov rsi, qword ptr [rbx]
rex test sil, 1
jne L2
test byte ptr [rsi-2], 63
jne L2
(Lines starting with #
are comments.)
The mov
instruction fetches the value of the BEAM register {x,0}
to the CPU register rsi
. The next two instructions test whether the
term is a pointer to an object on the heap. If it is, the header word
for the heap object is tested to make sure it is a tuple. The second
test is needed because the heap object could be some other Erlang term,
such as a binary, a map, or an integer that does not fit in a machine
word.
Now let’s see what the compiler and the JIT in OTP 25 do with this instruction. The BEAM code is now:
{test,is_tuple,
{f,3},
[{tr,{x,0},
{t_union,{t_atom,[error]},
none,none,
[{{2,{t_atom,[ok]}},
{t_tuple,2,true,
#{1 => {t_atom,[ok]},
2 => {t_integer,any}}}}],
none}}]}.
The operand that was {x,0}
in OTP 24 is now a tuple:
{tr,Register,Type}
That is, it is a three-tuple with tr
as the first element. tr
stands for typed register. The second element is the BEAM register
({x,0}
in this case), and the third element is the type of the
register in the compiler’s internal type representation. The type
is equivalent to the following type spec:
'error' | {'ok', integer()}
The JIT cannot take advantage of that level of detail in the types, so the compiler embeds a simplified version of that type into the BEAM file. The embedded type is equivalent to:
atom() | tuple()
By knowing that {x,0}
must be an atom or a tuple, the JIT in OTP 25
emits the following simplified native code:
# i_is_tuple_fs
mov rsi, qword ptr [rbx]
# simplified tuple test since the source is always a tuple when boxed
rex test sil, 1
jne label_3
(The JIT generally emits a comment when type information made a simplification possible.)
Only the first test is now necessary, because if the term is a pointer to a heap object, according to the type information, it must be a tuple.
Simplification of relational operators #
As another example, let’s look at how the relational operators in guards are translated. Consider this function:
my_less_than(A, B) ->
if
A < B -> smaller;
true -> larger_or_equal
end.
The BEAM code looks like this:
{test,is_lt,{f,9},[{x,0},{x,1}]}.
{move,{atom,smaller},{x,0}}.
return.
{label,9}.
{move,{atom,larger_or_equal},{x,0}}.
return.
When relational operators are used as guard tests, the compiler rewrites
them as special instructions. Thus, the <
operator is rewritten to an
is_lt
instruction.
The <
operator can compare any Erlang terms. It would be impractical
for the JIT to emit the code to handle all kinds of terms. Therefore, the
JIT emits code that directly handles the most common case and
calls a generic routine to handle everything else:
# is_lt_fss
mov rsi, qword ptr [rbx+8]
mov rdi, qword ptr [rbx]
mov eax, edi
and eax, esi
and al, 15
cmp al, 15
short jne L39
cmp rdi, rsi
short jmp L40
L39:
call 5447639136
L40:
jge label_9
Let’s walk through the code. The first two instructions:
mov rsi, qword ptr [rbx+8]
mov rdi, qword ptr [rbx]
fetches the BEAM registers {x,1}
and {x,0}
into CPU registers.
The most common comparison is between two integers. Depending on the
magnitude, integers can be represented in two different ways. On a 64-bit
computer, signed integers that fit in 60 bits will be stored directly
in a 64-bit word. The remaining 4 bits in the words are used for the
tag, which for a small integer is 15
. If the integer does
not fit, it will be represented as a bignum, which is pointer to
an object on the heap.
Here is the native code for testing that both operands are small:
mov eax, edi
and eax, esi
and al, 15
cmp al, 15
short jne L39
If one or both of the operands have another tag than 15
(are not
small integers), control is transferred to code at label L39
that
handles all other types of terms.
The next lines do the comparison of the small integers. The code is
written in a slightly convoluted way so that the conditional jump
(jge label_9
) that transfers control to the failure label can be
shared with the generic code:
cmp rdi, rsi
short jmp L40
L39:
call 5447639136
L40:
jge label_9
Thus, without type information, 11 instructions are needed to implement
is_lt
.
Now let’s see what happens when types are available:
my_less_than(A, B) when is_integer(A), is_integer(B) ->
.
.
.
When compiled by the compiler in OTP 25, the BEAM code is:
{test,is_integer,{f,7},[{x,0}]}.
{test,is_integer,{f,7},[{x,1}]}.
{test,is_lt,{f,9},[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}]}.
{move,{atom,smaller},{x,0}}.
return.
{label,9}.
{move,{atom,larger_or_equal},{x,0}}.
return.
The operands for the is_lt
instruction now have types. The BEAM
registers {x,0}
and {x,1}
have the type {t_integer,any}
, which
means an integer with an unknown range.
Having that knowledge of the types, the JIT can emit a slightly shorter test for a small integer:
# simplified small test since all other types are boxed
mov eax, edi
and eax, esi
test al, 1
short je L39
To do a better job, the JIT will need better type information. For example:
map_size_less_than(Map1, Map2) ->
if
map_size(Map1) < map_size(Map2) -> smaller;
true -> larger_or_equal
end.
The BEAM code looks like this:
{gc_bif,map_size,{f,12},2,[{x,0}],{x,0}}.
{gc_bif,map_size,{f,12},2,[{x,1}],{x,1}}.
{test,is_lt,
{f,12},
[{tr,{x,0},{t_integer,{0,288230376151711743}}},
{tr,{x,1},{t_integer,{0,288230376151711743}}}]}.
{move,{atom,smaller},{x,0}}.
return.
{label,12}.
{move,{atom,larger_or_equal},{x,0}}.
return.
Both operands for is_lt
now have the type
{t_integer,{0,288230376151711743}}
, meaning an integer in the range
0 through 288230376151711743 (that is, (1 bsl 58) - 1
). There is no
documented upper limit for the number of elements in a map, but for
the foreseeable future, there is no way that the number of elements in
a map will exceed or even get close to 288230376151711743.
Since both the lower and upper bounds for {x,0}
and {x,1}
fit in
60 bits, there is no need to test the type of the operands:
# is_lt_fss
mov rsi, qword ptr [rbx+8]
mov rdi, qword ptr [rbx]
# skipped test for small operands since they are always small
cmp rdi, rsi
L42:
L43:
jge label_12
Since the operands are always small, the call to the generic routine
(following label L42
) has been omitted.
Simplification of addition #
Looking at arithmetic instructions, we will see the potential for nice simplifications by the JIT, but unfortunately we will also see the limitations of the type analysis done by the Erlang compiler in OTP 25.
Let’s look at the generated code for this function:
add1(X, Y) ->
X + Y.
The BEAM code looks like this:
{gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
return.
The JIT translates the +
instruction to the following native instructions:
# i_plus_ssjd
mov rsi, qword ptr [rbx]
mov rdx, qword ptr [rbx+8]
# are both operands small?
mov eax, esi
and eax, edx
and al, 15
cmp al, 15
short jne L15
# add with overflow check
mov rax, rsi
mov rcx, rdx
and rcx, -16
add rax, rcx
short jno L14
L15:
call 4328985696
L14:
mov qword ptr [rbx], rax
The first two instructions:
mov rsi, qword ptr [rbx]
mov rdx, qword ptr [rbx+8]
loads the operands for the +
operation BEAM registers into CPU registers.
The next 5 instructions tests for small operands:
# are both operands small?
mov eax, esi
and eax, edx
and al, 15
cmp al, 15
short jne L15
The code is almost identical to the code in the is_lt
instruction
that we examined earlier. The only difference is that other CPU
registers are used. If one or both of the operands is not a small
integer, a jump is made to label L15
, which looks like this:
L15:
call 4328985696
This code calls a generic routine that can add any combination of
small, bignums, or floats. The generic routine will also handle
non-number operands by raising a badarith
exception.
If both operands indeed are smalls, the following code adds them and checks for overflow:
# add with overflow check
mov rax, rsi
mov rcx, rdx
and rcx, -16
add rax, rcx
short jno L14
If the addition overflowed, the generic addition routine is called. Otherwise, control is transferred to the following instruction:
mov qword ptr [rbx], rax
which stores the result in {x,0}
.
To summarize, the addition itself (including dealing with the tags) requires 4 instructions. However, 10 more instructions are needed to:
- Fetch operands from BEAM registers.
- Check that the operands are small integers (at most 60 bits).
- Calling the generic addition routine.
- Storing the result to a BEAM register.
Now let’s see what happens if types are introduced.
Consider:
add2(X0, Y0) ->
X = 2 * X0,
Y = 2 * Y0,
X + Y.
The BEAM code looks like:
{gc_bif,'*',{f,0},2,[{x,0},{integer,2}],{x,0}}.
{gc_bif,'*',{f,0},2,[{x,1},{integer,2}],{x,1}}.
{gc_bif,'+',{f,0},2,[{tr,{x,0},number},{tr,{x,1},number}],{x,0}}.
return.
Types are propagated from arithmetic instructions to other arithmetic
instructions. Because the result of *
(if it succeeds) is a number
(integer or float), the operands for the +
instruction now have the
type number
.
Based on our experience of adding types to the <
operator, we might
guess that we would save only one instruction in the type test. We
would be right:
# simplified test for small operands since both are numbers
mov eax, esi
and eax, edx
test al, 1
short je L22
Returning to the simpler example with addition and no multiplication,
let’s add a guard to ensure that X
and Y
are integers:
add3(X, Y) when is_integer(X), is_integer(Y) ->
X + Y.
That results in the following BEAM code:
{test,is_integer,{f,5},[{x,0}]}.
{test,is_integer,{f,5},[{x,1}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
{x,0}}.
return.
The types for both operands are now {t_integer,any}
. However, that
will still result in the same simplified four-instruction sequence for
testing small integers, because the integers might not fit in 60 bits.
Clearly, based on our experience with is_lt
, we will need to establish
a range for X
and Y
. A reasonable way to do that would be:
add4(X, Y) when is_integer(X), 0 =< X, X < 16#400,
is_integer(Y), 0 =< Y, Y < 16#400 ->
X + Y.
However, because of limitations in the compiler’s value range analysis,
the types for the +
operator will not improve:
{test,is_integer,{f,19},[{x,0}]}.
{test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
{test,is_lt,{f,19},[{tr,{x,0},{t_integer,any}},{integer,1024}]}.
{test,is_integer,{f,19},[{x,1}]}.
{test,is_ge,{f,19},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
{test,is_lt,{f,19},[{tr,{x,1},{t_integer,any}},{integer,1024}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
{x,0}}.
return.
To add insult to injury, the first 6 instructions cannot be simplified
by the JIT because there is not sufficient type information. That is,
the is_lt
and is_ge
instructions will comprise 11 instructions each.
We aim to improve the type analysis and optimizations in OTP 26 and generate better code for this example. We are also considering adding a new guard BIF in OTP 26 for testing that a term is an integer in a given range.
Meanwhile, while we wait for OTP 26, there is a way in
OTP 25 to write an equivalent guard that will result in
much more efficient code and establish known ranges for X
and
Y
:
add5(X, Y) when X =:= X band 16#3FF,
Y =:= Y band 16#3FF ->
X + Y.
We are showing this way of writing guard for illustrative purposes only; we don’t recommend rewriting your guards in this way.
The band
operator fails if not both of its operands are integers, so
no is_integer/1
test is needed. The =:=
comparison will return
false
if the corresponding variable is outside the range 0
through
16#3FF
.
That will result in the following BEAM code, where the compiler now
has been able to figure out the possible ranges for the operands of
the +
operator:
{gc_bif,'band',{f,21},2,[{x,0},{integer,1023}],{x,2}}.
{test,is_eq_exact,
{f,21},
[{tr,{x,0},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
{gc_bif,'band',{f,21},2,[{x,1},{integer,1023}],{x,2}}.
{test,is_eq_exact,
{f,21},
[{tr,{x,1},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
{x,0}}.
return.
Also, the 4 instructions that precede the +
instructions are now
relatively efficient.
The band
instruction needs to test the operands and be prepared to handle
integers that don’t fit in 60 bits:
# i_band_ssjd
mov rsi, qword ptr [rbx]
mov eax, 16383
# is the operand small?
mov edi, esi
and edi, 15
cmp edi, 15
short jne L97
and rax, rsi
short jmp L98
L97:
call 4456532680
short je label_25
L98:
mov qword ptr [rbx+16], rax
The is_eq_exact
instruction benefits from type information derived from
executing the band
instruction. Since the right-hand side operand is known
to be a small integer that fits in a machine word, a simple comparison is
sufficient with no need for fallback code to handle other Erlang terms:
# is_eq_exact_fss
# simplified check since one argument is an immediate
mov rdi, qword ptr [rbx+16]
cmp qword ptr [rbx], rdi
short jne label_25
The JIT generates the following code for the +
operator:
# i_plus_ssjd
# add without overflow check
mov rax, qword ptr [rbx]
mov rsi, qword ptr [rbx+8]
and rax, -16
add rax, rsi
mov qword ptr [rbx], rax
Simplifications for base64
#
As far as we know, base64
is the module in OTP that has benefited
the most of the improvements in OTP 25.
Here follows benchmark results for a benchmark included in a Github issue. First the results for OTP 24 on my computer:
== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 19805 ms: 50 it/sec
fun base64:decode/1: 1000 iterations in 20075 ms: 49 it/sec
The results for OTP 25 on the same computer:
== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 16024 ms: 62 it/sec
fun base64:decode/1: 1000 iterations in 18306 ms: 54 it/sec
In OTP 25, the encoding is done in 80 percent of the time that OTP 24 needs. Decoding is also more than a second faster.
The base64
module has not been modified in OTP 25, so the improvements
are entirely down to improvements in the compiler and the JIT.
Here is the clause of encode_binary/2
in the base64
module that does
most of the work of encoding a binary to Base64:
encode_binary(<<B1:8, B2:8, B3:8, Ls/bits>>, A) ->
BB = (B1 bsl 16) bor (B2 bsl 8) bor B3,
encode_binary(Ls,
<<A/bits,(b64e(BB bsr 18)):8,
(b64e((BB bsr 12) band 63)):8,
(b64e((BB bsr 6) band 63)):8,
(b64e(BB band 63)):8>>).
The binary matching in the function head establishes ranges for the
the variables B1
, B2
, and B3
. (The types for all three variables
will be {t_integer,{0,255}}
.)
Because of the ranges, all of the bsl
, bsr
, band
, and bor
operations that follow do not need any type checks. Also, in the
creation of the binary, there is no need to test whether the binary
creation succeeded because all values are known to be small integers.
The 4 calls to the b64e/1
functions are inlined. The function
looks like this:
-compile({inline, [{b64e, 1}]}).
b64e(X) ->
element(X+1,
{$A, $B, $C, $D, $E, $F, $G, $H, $I, $J, $K, $L, $M, $N,
$O, $P, $Q, $R, $S, $T, $U, $V, $W, $X, $Y, $Z,
$a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n,
$o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z,
$0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $+, $/}).
In OTP 25, the JIT will optimize calls to element/2
where the
position argument is an integer and the tuple argument is a literal
tuple. For the way element/2
is used in be64e/1
, all type tests
and range checks will be removed:
# bif_element_jssd
# skipped tuple test since source is always a literal tuple
L302:
long mov rsi, 9223372036854775807
mov rdi, qword ptr [rbx+24]
lea rcx, qword ptr [rsi-2]
# skipped test for small position since it is always small
mov rax, rdi
sar rax, 4
# skipped check for position =:= 0 since it is always >= 1
# skipped check for negative position and position beyond tuple
mov rax, qword ptr [rcx+rax*8]
L300:
L301:
mov qword ptr [rbx+24], rax
That is 7 instructions with no conditional branches.
Please try this at home! #
If you want to follow along and examine the native code for loaded modules, start the runtime system like this:
erl +JDdump true
The native code for all modules that are loaded will be dumped to files with the
extension .asm
.
To find code that has been simplified by the JIT, use this command:
egrep "simplified|skipped|without overflow" *.asm
To examine the BEAM code for a module, use the -S
option. For example:
erlc -S base64.erl
Pull requests #
Here are the main pull requests that implement type-based optimizations: