[erlang-bugs] R12B-3: string:to_integer() sporadic failures

Lev Walkin vlm@REDACTED
Tue Sep 2 09:21:13 CEST 2008


Denis just gave me an additional piece of information. It seems
that this heap rehashing happens during a context switch.
For instance, when adding a debugging code like this

io:format("ProcInfo=[~p]~n",[process_info(self(),reductions)]),

it shows that the number of reductions before and after the faulty
string:to_integer/1 is, respectively, below and above 1000 reductions.

This points to an intervening context switch, since, according to
documentation, a forced context switch happens every 1000 reductions.

	In  the  Beam  emulator,  the reduction  counter is normally
	incremented by one for each function and BIF call, and a context
	switch  is  forced  when  the counter reaches 1000.
		Source: erl -man erlang

We have not been able to see whether the subsequent fault happens
at the next (N mod 1000) reductions mark though, but it is likely.


Edwin Fine wrote:
> You did is a great piece of detective work. This really piqued my 
> interest and I did a bit of playing around.
> 
> It seems that the failures always occur at the same point in the 
> sequence. What I did is add a sequence number to each conversion, then 
> print them out. I conjectured that the failed values are actually all 
> integers but internally incorrectly represented so that the inequality 
> test fails. I also looked at the C code for the BIF for 
> string:to_integer, which was ... interesting.
> 
> The bottom line is that every time I ran the test code, the conversions 
> always failed at the same points (ran on Windows XP SP3, Erlang R12B-3). 
> I tried varying the erl parameters (turning off SMP, fiddling with 
> memory allocation, etc) but to no avail.
> 
> 46> nfail2:test(1000000).
> 6 bad conversions: [{308,{134217728,",\n"}},
>                     {500,{134217728,",\n"}},
>                     {1313,{134217728,",\n"}},
>                     {3589,{134217728,",\n"}},
>                     {5631,{134217728,",\n"}},
>                     {61947,{134217728,",\n"}}]
> 
> The only thing I could think of that would vary as the code ran is the 
> memory allocation. From lookoing at the C code, I could see that bignums 
> need heap allocation (a macro called HAlloc). If there was something in 
> the to_integer C code that messed up the process heap (by growing the 
> heap by too small an amount) so that every now and then, data was 
> written a tiny bit past the end of the heap without crashing Erlang, 
> this could account for the behavior. The reason the numbers are always 
> in the same places is because there are heap reallocations as the heap 
> grows, and if the bug is consistent, it will always screw up in the same 
> places.
> 
> To test this theory, I started Erlang with the +h flag that sets the 
> heap for each process to an initial size, to see if delaying heap 
> allocation fixed the problem.
> 
> It did fix the problem, sort of, well, it postponed it, as you will see.
> 
> G:\misc>erl +h 10000
> Eshell V5.6  (abort with ^G)
> 1> nfail2:test(10000).
> 0 bad conversions: []
> 0
> 2> nfail2:test(100000).
> 2 bad conversions: [{5463,{134217728,",\n"}},{8964,{134217728,",\n"}}]
> 2
> 
> Notice that the places in which the error occurs are now different.
> 
> Hope this helps.
> 
> 
> 2008/9/1 Lev Walkin <vlm@REDACTED <mailto:vlm@REDACTED>>
> 
> 
>     Hi,
> 
> 
>     we all love string to integer conversion routines, such as
>     string:to_integer/1 or erlang:list_to_integer/1.
> 
>     The functions serve us well and indeed provide us with advertised
>     functionality almost every time.
> 
>     However, we've noticed some oddity on our production system which,
>     after two weeks of jaw-dropping musings, has boiled down to a strange
>     idempotence violation in the string:to_integer/1 implementation.
> 
>     To those of you impatient enough, please look into the code
>     attached and try running it as nfail:test(10000) and go down from
>     there.
> 
>     Here's that string:to_integer/1 function, having the following
>     signature:
> 
>        string:to_integer(String) -> {Int,Rest} | {error,Reason}
> 
>     Here's a typical invocation resulting in a number and the
>     rest of the unparsed string returned:
> 
>     36> string:to_integer("134217728,\n").
>     {134217728,",\n"}
>     37>
> 
>     What would happen if we did it once more?
> 
>     37> string:to_integer("134217728,\n").
>     {134217728,",\n"}
>     38>
> 
>     By this time we can be reasonably sure that string:to_integer/1 will
>     return the same output given the same input. We have seen that this
>     is indeed the case by testing it twice. Could it be that testing
>     it N times would result in a bad behavior? Nah, unlikely, you say.
> 
>     If you haven't looked at the attached code, it is time to do so.
>     In the code, we create a list of N results of the string:to_integer/1
>     application, like this:
> 
>            iterate(0, Acc) -> Acc;
>            iterate(N, Acc) ->
>                    iterate(N - 1,
>                            [case string:to_integer("134217728,\n") of
>                                    {Int, _} -> Int
>                            end | Acc]).
> 
>     This code utilizes tail recursion with an accumulator list
>     which gets prepended N times by string:to_integer/1 output,
>     undoubtedly an integer. (Implementation with a map over lists:seq()
>     output can do as well).
> 
>     So we spawn and run this iterate/2 function, appropriately checking
>     that the list consists only of integers with value 134217728:
> 
>     test(N) ->
>            {Self, Ref} = {self(), make_ref()},
>            spawn_opt(fun()->
>                    L = iterate(N, []),
>                    % Filter out non-conforming entries
>                    BadList = [X || X <- L, X =/= 134217728],
>                    BadLen = length(BadList),
>                    % Here, BadLen should always be 0!
>                    io:format("~b bad conversions: ~p~n", [BadLen, BadList]),
>                    Self ! {done, Ref, BadLen}
>                    end,
>                    [link,{fullsweep_after,0}]),
>            receive {done, Ref, Len} -> Len end.
> 
>     Based on the smoke tests above, this code should always result
>     in something like this:
> 
>            38> nfail:test(100).
>            0 bad conversions: []
>            0
>            39>
> 
>     But let's start testing it thorougly, as if not believing ourselves
>     that such a simple function could possibly fail at times:
> 
>            [vlm@REDACTED:~]> erl
>            Erlang (BEAM) emulator version 5.6.3 [source]
>     [async-threads:0] [kernel-poll:false]
>            
>            Eshell V5.6.3  (abort with ^G)
>            1> c(nfail).
>            {ok,nfail}
>            2> nfail:test(1).
>            0 bad conversions: []
>            0
>            3> nfail:test(2).
>            0 bad conversions: []
>            0
>            4> nfail:test(100).
>            0 bad conversions: []
>            0
>            5> nfail:test(1000).
>            2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>            2
>            6> nfail:test(10000).
>            5 bad conversions: [{134217728,",\n"},
>                                {134217728,",\n"},
>                                {134217728,",\n"},
>                                {134217728,",\n"},
>                                {134217728,",\n"}]
>            5
>            7>
>            ...
>            32> nfail:test(810).
>            0 bad conversions: []
>            0
>            33> nfail:test(811).
>            2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>            2
>            34> nfail:test(810).
>            0 bad conversions: []
>            0
>            35> nfail:test(811).
>            2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>            2
>            36>
> 
> 
>     As you see, string:to_integer/1 consistently generates bad entries
>     which contain {134217728,",\n"} instead of 134217728, but only
>     when the number of invocation reaches about 1000 (811 in this
>     particular sequence of steps).
> 
>     Perhaps this is a platform glitch? The above code was being executed
>     on a ppc (G4) Mac OS X 10.5 with R12B-3 (32-bit) built from scratch.
>     Here's what Sun sparc v9 with Solaris 10 thinks about that test case:
> 
>            [vlm@REDACTED:~]> erl
>            Erlang (BEAM) emulator version 5.6.3 [source]
>     [async-threads:0] [hipe] [kernel-poll:false]
>            
>            Eshell V5.6.3  (abort with ^G)
>            1> c(nfail).
>            {ok,nfail}
>            2> nfail:test(100).
>            1 bad conversions: [{134217728,",\n"}]
>            1
>            3> nfail:test(1000).
>            3 bad conversions:
>     [{134217728,",\n"},{134217728,",\n"},{134217728,",\n"}]
>            3
>            4> nfail:test(10).
>            0 bad conversions: []
>            0
>            5> nfail:test(50).
>            0 bad conversions: []
>            0
>            6> c(nfail, [native]).
>            {ok,nfail}
>            7> nfail:test(500).
>            4 bad conversions: [{134217728,",\n"},
>                                {134217728,",\n"},
>                                {134217728,",\n"},
>                                {134217728,",\n"}]
>            4
>            8>
> 
>     And here's what 64-bit Intel box (am64, FreeBSD 6.3) thinks:
> 
>            [vlm@REDACTED ~]$ erl
>            Erlang (BEAM) emulator version 5.6.3 [source] [64-bit]
>     [async-threads:0] [hipe] [kernel-poll:false]
>            
>            Eshell V5.6.3  (abort with ^G)
>            1> c(nfail).
>            {ok,nfail}
>            2> nfail:test(100).
>            0 bad conversions: []
>            0
>            3> nfail:test(1000).
>            0 bad conversions: []
>            0
>            4> nfail:test(10000).
>            0 bad conversions: []
>            0
>            5> nfail:test(100000).
>            0 bad conversions: []
>            0
>            6> nfail:test(1000000).
>            0 bad conversions: []
>            0
>            7>
> 
>     Oops, it went very well, suprisingly. Perhaps, it is a purely
>     non-Intel chip problem? Let's try it on a non 64bit machine,
>     such as Pentium D under Microsoft Windows Vista™ in 32-bit mode:
> 
>            Erlang (BEAM) emulator version 5.6.3 [smp:2] [async-threads:0]
> 
>            Eshell V5.6.3  (abort with ^G)
>            1> c('c:/tmp/nfail.erl').
>            {ok,nfail}
>            2> nfail:test(100).
>            0 bad conversions: []
>            0
>            3> nfail:test(1000).
>            2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>            2
>            4>
> 
>     Aha! See the pattern: 32-bit Erlang installations on many hardware
>     platforms are having very similar problems. They are unable to
>     consistently convert a string containing an integer into an integer
>     value. Incidentally enough, the integer value for which Erlang
>     starts to misbehave is 134217728, which is 2^27. Trying it with
>     value "134217727" or lower does not create this idempotence problem.
>     Trying R11B-5 under Windows Vista™ 32-bit does not exhibit such
>     problem either, so it must be a specific issue to R12.
> 
>     Please advise.
> 
> 
>     P.S. Thanks to Denis Titoruk and Vladimir Serov for investigating
>     this issue and coming up with a short test case.
> 
>     -- 
>     Lev Walkin
>     vlm@REDACTED <mailto:vlm@REDACTED>
> 
>     -module(nfail).
>     -export([test/1]).
> 
>     iterate(0, Acc) -> Acc;
>     iterate(N, Acc) ->
>            iterate(N - 1,
>                    [case string:to_integer("134217728,\n") of
>                            {Int, _} -> Int
>                    end | Acc]).
> 
>     test(N) ->
>            {Self, Ref} = {self(), make_ref()},
>            spawn_opt(fun()->
>                    L = iterate(N, []),
>                    BadList = [X || X <- L, X =/= 134217728],
>                    BadLen = length(BadList),
>                    io:format("~b bad conversions: ~p~n", [BadLen, BadList]),
>                    Self ! {done, Ref, BadLen}
>                    end,
>                    [link,{fullsweep_after,0}]),
>            receive {done, Ref, Len} -> Len end.
> 
>     _______________________________________________
>     erlang-bugs mailing list
>     erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>     http://www.erlang.org/mailman/listinfo/erlang-bugs
> 
> 
> 
> 
> -- 
> For every expert there is an equal and opposite expert - Arthur C. Clarke
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs




More information about the erlang-bugs mailing list