[erlang-bugs] Segfault in asn1rt_nif:decode_ber_tlv

Fri Sep 12 14:23:02 CEST 2014

Hi erlang-bugs,

I've found a very interesting segmentation fault bug in 
asn1rt_nif:decode_ber_tlv.

A simple case to reproduce it: just type this at the shell:

   asn1rt_nif:decode_ber_tlv(<<"!",16#80>>).

It seems like whether or not the segmentation fault manifests varies a 
lot between different OS, compiler and OTP versions. So far I've 
reproduced it on:

  * R16B03-1 on Mac OSX Mavericks (clang)
  * R16B01 on OpenBSD 5.3-stable with gcc 4.2.1
  * R16B03-1 on Linux (Fedora) with gcc 4.8.3
  * R17B01 on OpenBSD 5.6-current with gcc 4.2.1
  * R17 developer build from "maint" branch as of last week, Mac OSX 
Mavericks (clang)

I also asked some random people in #erlang on freenode to try it out and 
they also reproduced the segfault using my test case.

Sometimes it doesn't segfault the first time around, but if you run it a 
few times at the shell it will do it eventually.


Backtrace from R17 maint on Mac OSX:

* thread #1: tid = 0x2410f5, 0x00007fff958db1aa 
libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 458, queue 
= 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x107ffff0)
   * frame #0: 0x00007fff958db1aa 
libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 458
     frame #1: 0x000000001037aa49 asn1rt_nif.so`decode_ber_tlv_raw 
[inlined] ber_decode_begin(env=0x00007fff5fbff548, in_buf=<unavailable>, 
in_buf_len=<unavailable>, err_pos=<unavailable>) + 80 at asn1_erl_nif.c:854
     frame #2: 0x000000001037a9f9 
asn1rt_nif.so`decode_ber_tlv_raw(env=0x00007fff5fbff548, 
argc=<unavailable>, argv=<unavailable>) + 41 at asn1_erl_nif.c:1256
     frame #3: 0x00000000100e4cbd beam`process_main + 58077 at 
beam_emu.c:3524
     frame #4: 0x00000000100196dd beam`erl_start(argc=21, 
argv=<unavailable>) + 5997 at erl_init.c:1990
     frame #5: 0x0000000010000df9 beam`main(argc=<unavailable>, 
argv=<unavailable>) + 9 at erl_main.c:29

Backtrace from R16B03-1 on Linux (gdb): (no symbols on that machine, 
sorry, but can clearly see it's the same trace)

#0  __memcpy_sse2_unaligned ()
     at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:157
#1  0x00007f0c48cbdace in ber_decode_begin ()
    from /usr/lib64/erlang/lib/asn1-2.0.4/priv/lib/asn1_erl_nif.so
#2  0x00007f0c48cbdb6f in decode_ber_tlv_raw ()
    from /usr/lib64/erlang/lib/asn1-2.0.4/priv/lib/asn1_erl_nif.so
#3  0x0000000000533254 in process_main ()
#4  0x00000000004508a0 in erl_start ()
#5  0x0000000000434049 in main ()


There's a lot of inlining and optimisation going on in this part of the 
code, which makes it hard to look back and forth between the assembly and C.

Anyway, the segfault is caused by running off the end of memory after 
doing a memcpy with -2 as the length (0xfffffffe). This is because 
ib_index has gotten to a value of 4 when the in_buf_len is only 2.

Using some watchpoints I figured out that ber_decode_value's if (indef 
== 1) block is responsible for the incrementing of ib_index beyond the 
end of the binary. I hacked up the following patch:

--- a/lib/asn1/c_src/asn1_erl_nif.c
+++ b/lib/asn1/c_src/asn1_erl_nif.c
@@ -968,16 +968,16 @@ static int ber_decode_value(ErlNifEnv* env, 
ERL_NIF_TERM *value, unsigned cha
      if (indef == 1) { /* in this case it is desireably to check that 
indefinite length
       end bytes exist in inbuffer */
         curr_head = enif_make_list(env, 0);
-       while (!(in_buf[*ib_index] == 0 && in_buf[*ib_index + 1] == 0)) {
-           if (*ib_index >= in_buf_len)
-               return ASN1_INDEF_LEN_ERROR;
-
+       while ((*ib_index + 1 < in_buf_len) &&
+               !(in_buf[*ib_index] == 0 && in_buf[*ib_index + 1] == 0)) {
             if ((maybe_ret = ber_decode(env, &term, in_buf, ib_index, 
in_buf_len))
                     <= ASN1_ERROR
                     )
                 return maybe_ret;
             curr_head = enif_make_list_cell(env, term, curr_head);
         }
+       if (*ib_index + 1 >= in_buf_len)
+               return ASN1_INDEF_LEN_ERROR;
         enif_make_reverse_list(env, curr_head, value);
         (*ib_index) += 2; /* skip the indefinite length end bytes */
      } else if (form == ASN1_CONSTRUCTED)


And it seems to stop it happening with my test case on OSX, at least. It 
makes two changes -- checking for *ib_index + 1 >= in_buf_len (because 
it's using in_buf[*ib_index + 1], it should check that _that_ index is 
valid, not just *ib_index). The second change is to move the check from 
an "if" inside the loop to a loop condition.

The movement to the loop condition seems to be the main difference. 
Looking at the assembly, it seems like the compiler is reasoning (after 
inlining the entire ber_decode function into this loop) that it can 
hoist that if out somehow?... maybe because the ber_decode code will 
already check it (???)

Putting it into the loop condition seems to stop this behaviour. But if 
I've misdiagnosed the problem (quite possible), then perhaps all I've 
done is permute the code just enough to stop the optimiser doing this 
one thing for now, and there are other problems just waiting in the wings...

Hoping somebody who knows the asn1 code better than I do (or C in 
general) might be able to help!