[erlang-bugs] Mac OS X - trunc for large float causes ERTS_FP_CHECK_INIT at [...]: detected unhandled FPE at [...]

Wed May 4 10:14:33 CEST 2011

On Tue, 3 May 2011 16:48:20 -0700, Bob Ippolito <bob@REDACTED> wrote:
> On Tue, May 3, 2011 at 3:35 PM, Mikael Pettersson <mikpe@REDACTED> wrote:
> > On Tue, 3 May 2011 07:18:34 -0700, Bob Ippolito <bob@REDACTED> wrote:
> >> On Tue, May 3, 2011 at 1:04 AM, Mikael Pettersson <mikpe@REDACTED> wrote=
> :
> >> > Bob Ippolito writes:
> >> > =3DC2=3DA0> I only see this error on Mac OS X. I have not been able to=
>  reprod=3D
> >> uce in Linux.
> >> > =3DC2=3DA0>
> >> > =3DC2=3DA0> Here's an example, any number larger than 16#7ffffffffffff=
> e00 wil=3D
> >> l
> >> > =3DC2=3DA0> cause this error.
> >> > =3DC2=3DA0>
> >> > =3DC2=3DA0> Erlang R14B02 (erts-5.8.3) [source] [64-bit] [smp:4:4] [rq=
> :4]
> >> > =3DC2=3DA0> [async-threads:4] [hipe] [kernel-poll:true]
> >> > =3DC2=3DA0>
> >> > =3DC2=3DA0> Eshell V5.8.3 =3DC2=3DA0(abort with ^G)
> >> > =3DC2=3DA0> 1> trunc(16#7ffffffffffffdff * 1.0).
> >> > =3DC2=3DA0> 9223372036854774784
> >> > =3DC2=3DA0> 2> trunc(16#7ffffffffffffdff * 1.0).
> >> > =3DC2=3DA0> 9223372036854774784
> >> > =3DC2=3DA0> 3> trunc(16#7ffffffffffffe00 * 1.0).
> >> > =3DC2=3DA0> 9223372036854775808
> >> > =3DC2=3DA0> 4> trunc(16#7ffffffffffffe00 * 1.0).
> >> > =3DC2=3DA0> ERTS_FP_CHECK_INIT at 0x10086210: detected unhandled FPE a=
> t
> >> > =3DC2=3DA0> 0x19223372036854775808
> >> > =3DC2=3DA0> 5> trunc(16#7ffffffffffffe00 * 1.0).
> >> > =3DC2=3DA0> ERTS_FP_CHECK_INIT at 0x10086210: detected unhandled FPE a=
> t
> >> > =3DC2=3DA0> 0x19223372036854775808
> >> > =3DC2=3DA0> 6> io:format("~s~n", [os:cmd("uname -a")]).
> >> > =3DC2=3DA0> Darwin ba.local 10.7.0 Darwin Kernel Version 10.7.0: Sat J=
> an 29
> >> > =3DC2=3DA0> 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386
> >> > =3DC2=3DA0>
> >> > =3DC2=3DA0> Here's another example:
> >> > =3DC2=3DA0>
> >> > =3DC2=3DA0> Erlang R14B02 (erts-5.8.3) [source] [64-bit] [smp:4:4] [rq=
> :4]
> >> > =3DC2=3DA0> [async-threads:4] [hipe] [kernel-poll:true]
> >> > =3DC2=3DA0>
> >> > =3DC2=3DA0> Eshell V5.8.3 =3DC2=3DA0(abort with ^G)
> >> > =3DC2=3DA0> 1> <<F/float>> =3D3D <<67,224,0,0,0,0,0,0>>, trunc(F).
> >> > =3DC2=3DA0> 9223372036854775808
> >> > =3DC2=3DA0> 2> <<F/float>> =3D3D <<67,224,0,0,0,0,0,0>>, trunc(F).
> >> > =3DC2=3DA0> ERTS_FP_CHECK_INIT at 0x10083e24: detected unhandled FPE a=
> t
> >> > =3DC2=3DA0> 0x19223372036854775808
> >> > =3DC2=3DA0> 3> <<F/float>> =3D3D <<67,224,0,0,0,0,0,0>>, trunc(F).
> >> > =3DC2=3DA0> ERTS_FP_CHECK_INIT at 0x10083e24: detected unhandled FPE a=
> t
> >> > =3DC2=3DA0> 0x19223372036854775808
> >> >
> >> > It means that the code at 0x19223372036854775808 in the
> >> > Erlang VM needs to use the proper ERTS_FP_CHECK_<foo> macros.
> >> >
> >> > Please attach gdb (or whatever debugger Darwin uses) to a running
> >> > Erlang VM and disassemble the code at 0x19223372036854775808.
> >> > We need to know the name of the enclosing function, and preferably
> >> > also the actual instruction sequence that throws the FPE. If gdb
> >> > can show the exact original source code line then that's even better.
> >> >
> >> > If this is in libc rather than the Erlang VM itself, then we need
> >> > a call trace to identify which code in the VM called out to this
> >> > FP-throwing code. =3DC2=3DA0For that you should probably plant a break=
> point
> >> > at 0x19223372036854775808 and then evaluate one of those Erlang
> >> > expressions above to trigger the FPE.
> >> >
> >>
> >> Well, it's actually saying 0x1, the result of the trunc is
> >> 9223372036854775808 =C2=A0and the message string is truncated to 64
> >> characters which is not enough to show it all. Perhaps the buffer size
> >> in erts_fp_check_init_error should be adjusted.
> >
> > Something in your terminal or email client ate a \r\n sequence after the
> > 0x1 from erts_fp_check_init_error() making it appear glued together with
> > the 9223372036854775808 that the erlang prompt printed.
> 
> Not my terminal or email client, this is a bug in
> erts_fp_check_init_error. It only allocates a 64 byte buffer for the
> error message. The pointer address and the \r\n are eaten because the
> buffer is too small to fit the whole error message. buf[64] is too
> small... the format string itself is already 57 chars (including the
> NULL).

Ah yes. I did see your comment about the short buffer but failed
to connect that with the strange message. The buffer needs to be at
least (calculating..) 89 bytes, making it 96 bytes should suffice.

This means that my comment about 0x1 and the wrong type SIGFPE
handler was invalid. (0x1 is used as a fake PC value in that case.)

> Maybe you missed it in my previous email, it's not 0x1, it is
> 0x10025433. I showed that by breaking at the function that prints the
> error.
> Breakpoint 1, erts_fp_check_init_error (fpexnp=3D0x110f2528) at
> sys/unix/sys_float.c:87
> 87      {
> (gdb) p (void*)*fpexnp
> $1 =3D (void *) 0x10025433

In your previous disassembly that pointed to a cvttsd2siq instruction.
That can probably throw a SIGFPE, but I see similar code in a build on
Linux, and there SIGFPE isn't thrown.

If you attach gdb to a freshly started beam instance, let the process
continue, and evaluate one of those expressions at the Erlang prompt,
then gdb should wake up with a SIGFPE at that location.  At that point
dump parts of the SSE2 state with:

print $mxcsr (SSE control and status flags)
print $xmm1 (the source operand in the failing SSE instruction)

(If the first SIGFPE occurs elsewhere, disassemble that code first, then
adjust the print $xmm1 to match that instruction's source operand.)

/Mikael