[erlang-bugs] Segfault in asn1rt_nif:decode_ber_tlv

Wed Sep 24 16:11:29 CEST 2014

Hi Simon,

As you can see in my late answer to Alex,
we have fixed the bug. But I did not add
an test for a primitive type encoded with
an indefinite length.

https://github.com/erlang/otp/commit/7f385ebd984ed2931daa761819816b3e9da7d63c

/Bjorn

On Mon, Sep 22, 2014 at 3:50 PM, Simon Cornish <zl9d97p02@REDACTED> wrote:
> Hi Alex,
> Your analysis is correct regarding that the underlying fault is
> reading beyond the end of the buffer
> The attached patch solves your problem in the same (although
> stylistically different) way.
> It also corrects the fault that a primitive cannot be encoded using an
> indefinite length. This is both according to X.690 and common sense!
> If I get some breathing room later this week I'll put together a pull
> request for this with some test.
> Regards,
>  Simon
>
>
> On 12 September 2014 14:23, Alex Wilson alex-at-cooperi.net
> |erlang-mail/gmail| <um3p2mgtft@REDACTED> wrote:
>> Hi erlang-bugs,
>>
>> I've found a very interesting segmentation fault bug in
>> asn1rt_nif:decode_ber_tlv.
>>
>> A simple case to reproduce it: just type this at the shell:
>>
>>   asn1rt_nif:decode_ber_tlv(<<"!",16#80>>).
>>
>> It seems like whether or not the segmentation fault manifests varies a lot
>> between different OS, compiler and OTP versions. So far I've reproduced it
>> on:
>>
>>  * R16B03-1 on Mac OSX Mavericks (clang)
>>  * R16B01 on OpenBSD 5.3-stable with gcc 4.2.1
>>  * R16B03-1 on Linux (Fedora) with gcc 4.8.3
>>  * R17B01 on OpenBSD 5.6-current with gcc 4.2.1
>>  * R17 developer build from "maint" branch as of last week, Mac OSX
>> Mavericks (clang)
>>
>> I also asked some random people in #erlang on freenode to try it out and
>> they also reproduced the segfault using my test case.
>>
>> Sometimes it doesn't segfault the first time around, but if you run it a few
>> times at the shell it will do it eventually.
>>
>>
>> Backtrace from R17 maint on Mac OSX:
>>
>> * thread #1: tid = 0x2410f5, 0x00007fff958db1aa
>> libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 458, queue =
>> 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
>> address=0x107ffff0)
>>   * frame #0: 0x00007fff958db1aa
>> libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 458
>>     frame #1: 0x000000001037aa49 asn1rt_nif.so`decode_ber_tlv_raw [inlined]
>> ber_decode_begin(env=0x00007fff5fbff548, in_buf=<unavailable>,
>> in_buf_len=<unavailable>, err_pos=<unavailable>) + 80 at asn1_erl_nif.c:854
>>     frame #2: 0x000000001037a9f9
>> asn1rt_nif.so`decode_ber_tlv_raw(env=0x00007fff5fbff548, argc=<unavailable>,
>> argv=<unavailable>) + 41 at asn1_erl_nif.c:1256
>>     frame #3: 0x00000000100e4cbd beam`process_main + 58077 at
>> beam_emu.c:3524
>>     frame #4: 0x00000000100196dd beam`erl_start(argc=21, argv=<unavailable>)
>> + 5997 at erl_init.c:1990
>>     frame #5: 0x0000000010000df9 beam`main(argc=<unavailable>,
>> argv=<unavailable>) + 9 at erl_main.c:29
>>
>> Backtrace from R16B03-1 on Linux (gdb): (no symbols on that machine, sorry,
>> but can clearly see it's the same trace)
>>
>> #0  __memcpy_sse2_unaligned ()
>>     at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:157
>> #1  0x00007f0c48cbdace in ber_decode_begin ()
>>    from /usr/lib64/erlang/lib/asn1-2.0.4/priv/lib/asn1_erl_nif.so
>> #2  0x00007f0c48cbdb6f in decode_ber_tlv_raw ()
>>    from /usr/lib64/erlang/lib/asn1-2.0.4/priv/lib/asn1_erl_nif.so
>> #3  0x0000000000533254 in process_main ()
>> #4  0x00000000004508a0 in erl_start ()
>> #5  0x0000000000434049 in main ()
>>
>>
>> There's a lot of inlining and optimisation going on in this part of the
>> code, which makes it hard to look back and forth between the assembly and C.
>>
>> Anyway, the segfault is caused by running off the end of memory after doing
>> a memcpy with -2 as the length (0xfffffffe). This is because ib_index has
>> gotten to a value of 4 when the in_buf_len is only 2.
>>
>> Using some watchpoints I figured out that ber_decode_value's if (indef == 1)
>> block is responsible for the incrementing of ib_index beyond the end of the
>> binary. I hacked up the following patch:
>>
>>
>> --- a/lib/asn1/c_src/asn1_erl_nif.c
>> +++ b/lib/asn1/c_src/asn1_erl_nif.c
>> @@ -968,16 +968,16 @@ static int ber_decode_value(ErlNifEnv* env,
>> ERL_NIF_TERM *value, unsigned cha
>>      if (indef == 1) { /* in this case it is desireably to check that
>> indefinite length
>>       end bytes exist in inbuffer */
>>         curr_head = enif_make_list(env, 0);
>> -       while (!(in_buf[*ib_index] == 0 && in_buf[*ib_index + 1] == 0)) {
>> -           if (*ib_index >= in_buf_len)
>> -               return ASN1_INDEF_LEN_ERROR;
>> -
>> +       while ((*ib_index + 1 < in_buf_len) &&
>> +               !(in_buf[*ib_index] == 0 && in_buf[*ib_index + 1] == 0)) {
>>             if ((maybe_ret = ber_decode(env, &term, in_buf, ib_index,
>> in_buf_len))
>>                     <= ASN1_ERROR
>>                     )
>>                 return maybe_ret;
>>             curr_head = enif_make_list_cell(env, term, curr_head);
>>         }
>> +       if (*ib_index + 1 >= in_buf_len)
>> +               return ASN1_INDEF_LEN_ERROR;
>>         enif_make_reverse_list(env, curr_head, value);
>>         (*ib_index) += 2; /* skip the indefinite length end bytes */
>>      } else if (form == ASN1_CONSTRUCTED)
>>
>>
>> And it seems to stop it happening with my test case on OSX, at least. It
>> makes two changes -- checking for *ib_index + 1 >= in_buf_len (because it's
>> using in_buf[*ib_index + 1], it should check that _that_ index is valid, not
>> just *ib_index). The second change is to move the check from an "if" inside
>> the loop to a loop condition.
>>
>> The movement to the loop condition seems to be the main difference. Looking
>> at the assembly, it seems like the compiler is reasoning (after inlining the
>> entire ber_decode function into this loop) that it can hoist that if out
>> somehow?... maybe because the ber_decode code will already check it (???)
>>
>> Putting it into the loop condition seems to stop this behaviour. But if I've
>> misdiagnosed the problem (quite possible), then perhaps all I've done is
>> permute the code just enough to stop the optimiser doing this one thing for
>> now, and there are other problems just waiting in the wings...
>>
>> Hoping somebody who knows the asn1 code better than I do (or C in general)
>> might be able to help!
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs

-- 
Björn Gustavsson, Erlang/OTP, Ericsson AB