[erlang-bugs] R12B-3: string:to_integer() sporadic failures

Dan Gudmundsson dgud@REDACTED
Tue Sep 2 09:32:02 CEST 2008


We have found the error and a fix will be included the next release.

/Dan

Lev Walkin wrote:
> Denis just gave me an additional piece of information. It seems
> that this heap rehashing happens during a context switch.
> For instance, when adding a debugging code like this
> 
> io:format("ProcInfo=[~p]~n",[process_info(self(),reductions)]),
> 
> it shows that the number of reductions before and after the faulty
> string:to_integer/1 is, respectively, below and above 1000 reductions.
> 
> This points to an intervening context switch, since, according to
> documentation, a forced context switch happens every 1000 reductions.
> 
> 	In  the  Beam  emulator,  the reduction  counter is normally
> 	incremented by one for each function and BIF call, and a context
> 	switch  is  forced  when  the counter reaches 1000.
> 		Source: erl -man erlang
> 
> We have not been able to see whether the subsequent fault happens
> at the next (N mod 1000) reductions mark though, but it is likely.
> 
> 
> Edwin Fine wrote:
>> You did is a great piece of detective work. This really piqued my 
>> interest and I did a bit of playing around.
>>
>> It seems that the failures always occur at the same point in the 
>> sequence. What I did is add a sequence number to each conversion, then 
>> print them out. I conjectured that the failed values are actually all 
>> integers but internally incorrectly represented so that the inequality 
>> test fails. I also looked at the C code for the BIF for 
>> string:to_integer, which was ... interesting.
>>
>> The bottom line is that every time I ran the test code, the conversions 
>> always failed at the same points (ran on Windows XP SP3, Erlang R12B-3). 
>> I tried varying the erl parameters (turning off SMP, fiddling with 
>> memory allocation, etc) but to no avail.
>>
>> 46> nfail2:test(1000000).
>> 6 bad conversions: [{308,{134217728,",\n"}},
>>                     {500,{134217728,",\n"}},
>>                     {1313,{134217728,",\n"}},
>>                     {3589,{134217728,",\n"}},
>>                     {5631,{134217728,",\n"}},
>>                     {61947,{134217728,",\n"}}]
>>
>> The only thing I could think of that would vary as the code ran is the 
>> memory allocation. From lookoing at the C code, I could see that bignums 
>> need heap allocation (a macro called HAlloc). If there was something in 
>> the to_integer C code that messed up the process heap (by growing the 
>> heap by too small an amount) so that every now and then, data was 
>> written a tiny bit past the end of the heap without crashing Erlang, 
>> this could account for the behavior. The reason the numbers are always 
>> in the same places is because there are heap reallocations as the heap 
>> grows, and if the bug is consistent, it will always screw up in the same 
>> places.
>>
>> To test this theory, I started Erlang with the +h flag that sets the 
>> heap for each process to an initial size, to see if delaying heap 
>> allocation fixed the problem.
>>
>> It did fix the problem, sort of, well, it postponed it, as you will see.
>>
>> G:\misc>erl +h 10000
>> Eshell V5.6  (abort with ^G)
>> 1> nfail2:test(10000).
>> 0 bad conversions: []
>> 0
>> 2> nfail2:test(100000).
>> 2 bad conversions: [{5463,{134217728,",\n"}},{8964,{134217728,",\n"}}]
>> 2
>>
>> Notice that the places in which the error occurs are now different.
>>
>> Hope this helps.
>>
>>
>> 2008/9/1 Lev Walkin <vlm@REDACTED <mailto:vlm@REDACTED>>
>>
>>
>>     Hi,
>>
>>
>>     we all love string to integer conversion routines, such as
>>     string:to_integer/1 or erlang:list_to_integer/1.
>>
>>     The functions serve us well and indeed provide us with advertised
>>     functionality almost every time.
>>
>>     However, we've noticed some oddity on our production system which,
>>     after two weeks of jaw-dropping musings, has boiled down to a strange
>>     idempotence violation in the string:to_integer/1 implementation.
>>
>>     To those of you impatient enough, please look into the code
>>     attached and try running it as nfail:test(10000) and go down from
>>     there.
>>
>>     Here's that string:to_integer/1 function, having the following
>>     signature:
>>
>>        string:to_integer(String) -> {Int,Rest} | {error,Reason}
>>
>>     Here's a typical invocation resulting in a number and the
>>     rest of the unparsed string returned:
>>
>>     36> string:to_integer("134217728,\n").
>>     {134217728,",\n"}
>>     37>
>>
>>     What would happen if we did it once more?
>>
>>     37> string:to_integer("134217728,\n").
>>     {134217728,",\n"}
>>     38>
>>
>>     By this time we can be reasonably sure that string:to_integer/1 will
>>     return the same output given the same input. We have seen that this
>>     is indeed the case by testing it twice. Could it be that testing
>>     it N times would result in a bad behavior? Nah, unlikely, you say.
>>
>>     If you haven't looked at the attached code, it is time to do so.
>>     In the code, we create a list of N results of the string:to_integer/1
>>     application, like this:
>>
>>            iterate(0, Acc) -> Acc;
>>            iterate(N, Acc) ->
>>                    iterate(N - 1,
>>                            [case string:to_integer("134217728,\n") of
>>                                    {Int, _} -> Int
>>                            end | Acc]).
>>
>>     This code utilizes tail recursion with an accumulator list
>>     which gets prepended N times by string:to_integer/1 output,
>>     undoubtedly an integer. (Implementation with a map over lists:seq()
>>     output can do as well).
>>
>>     So we spawn and run this iterate/2 function, appropriately checking
>>     that the list consists only of integers with value 134217728:
>>
>>     test(N) ->
>>            {Self, Ref} = {self(), make_ref()},
>>            spawn_opt(fun()->
>>                    L = iterate(N, []),
>>                    % Filter out non-conforming entries
>>                    BadList = [X || X <- L, X =/= 134217728],
>>                    BadLen = length(BadList),
>>                    % Here, BadLen should always be 0!
>>                    io:format("~b bad conversions: ~p~n", [BadLen, BadList]),
>>                    Self ! {done, Ref, BadLen}
>>                    end,
>>                    [link,{fullsweep_after,0}]),
>>            receive {done, Ref, Len} -> Len end.
>>
>>     Based on the smoke tests above, this code should always result
>>     in something like this:
>>
>>            38> nfail:test(100).
>>            0 bad conversions: []
>>            0
>>            39>
>>
>>     But let's start testing it thorougly, as if not believing ourselves
>>     that such a simple function could possibly fail at times:
>>
>>            [vlm@REDACTED:~]> erl
>>            Erlang (BEAM) emulator version 5.6.3 [source]
>>     [async-threads:0] [kernel-poll:false]
>>            
>>            Eshell V5.6.3  (abort with ^G)
>>            1> c(nfail).
>>            {ok,nfail}
>>            2> nfail:test(1).
>>            0 bad conversions: []
>>            0
>>            3> nfail:test(2).
>>            0 bad conversions: []
>>            0
>>            4> nfail:test(100).
>>            0 bad conversions: []
>>            0
>>            5> nfail:test(1000).
>>            2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>>            2
>>            6> nfail:test(10000).
>>            5 bad conversions: [{134217728,",\n"},
>>                                {134217728,",\n"},
>>                                {134217728,",\n"},
>>                                {134217728,",\n"},
>>                                {134217728,",\n"}]
>>            5
>>            7>
>>            ...
>>            32> nfail:test(810).
>>            0 bad conversions: []
>>            0
>>            33> nfail:test(811).
>>            2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>>            2
>>            34> nfail:test(810).
>>            0 bad conversions: []
>>            0
>>            35> nfail:test(811).
>>            2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>>            2
>>            36>
>>
>>
>>     As you see, string:to_integer/1 consistently generates bad entries
>>     which contain {134217728,",\n"} instead of 134217728, but only
>>     when the number of invocation reaches about 1000 (811 in this
>>     particular sequence of steps).
>>
>>     Perhaps this is a platform glitch? The above code was being executed
>>     on a ppc (G4) Mac OS X 10.5 with R12B-3 (32-bit) built from scratch.
>>     Here's what Sun sparc v9 with Solaris 10 thinks about that test case:
>>
>>            [vlm@REDACTED:~]> erl
>>            Erlang (BEAM) emulator version 5.6.3 [source]
>>     [async-threads:0] [hipe] [kernel-poll:false]
>>            
>>            Eshell V5.6.3  (abort with ^G)
>>            1> c(nfail).
>>            {ok,nfail}
>>            2> nfail:test(100).
>>            1 bad conversions: [{134217728,",\n"}]
>>            1
>>            3> nfail:test(1000).
>>            3 bad conversions:
>>     [{134217728,",\n"},{134217728,",\n"},{134217728,",\n"}]
>>            3
>>            4> nfail:test(10).
>>            0 bad conversions: []
>>            0
>>            5> nfail:test(50).
>>            0 bad conversions: []
>>            0
>>            6> c(nfail, [native]).
>>            {ok,nfail}
>>            7> nfail:test(500).
>>            4 bad conversions: [{134217728,",\n"},
>>                                {134217728,",\n"},
>>                                {134217728,",\n"},
>>                                {134217728,",\n"}]
>>            4
>>            8>
>>
>>     And here's what 64-bit Intel box (am64, FreeBSD 6.3) thinks:
>>
>>            [vlm@REDACTED ~]$ erl
>>            Erlang (BEAM) emulator version 5.6.3 [source] [64-bit]
>>     [async-threads:0] [hipe] [kernel-poll:false]
>>            
>>            Eshell V5.6.3  (abort with ^G)
>>            1> c(nfail).
>>            {ok,nfail}
>>            2> nfail:test(100).
>>            0 bad conversions: []
>>            0
>>            3> nfail:test(1000).
>>            0 bad conversions: []
>>            0
>>            4> nfail:test(10000).
>>            0 bad conversions: []
>>            0
>>            5> nfail:test(100000).
>>            0 bad conversions: []
>>            0
>>            6> nfail:test(1000000).
>>            0 bad conversions: []
>>            0
>>            7>
>>
>>     Oops, it went very well, suprisingly. Perhaps, it is a purely
>>     non-Intel chip problem? Let's try it on a non 64bit machine,
>>     such as Pentium D under Microsoft Windows Vista™ in 32-bit mode:
>>
>>            Erlang (BEAM) emulator version 5.6.3 [smp:2] [async-threads:0]
>>
>>            Eshell V5.6.3  (abort with ^G)
>>            1> c('c:/tmp/nfail.erl').
>>            {ok,nfail}
>>            2> nfail:test(100).
>>            0 bad conversions: []
>>            0
>>            3> nfail:test(1000).
>>            2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>>            2
>>            4>
>>
>>     Aha! See the pattern: 32-bit Erlang installations on many hardware
>>     platforms are having very similar problems. They are unable to
>>     consistently convert a string containing an integer into an integer
>>     value. Incidentally enough, the integer value for which Erlang
>>     starts to misbehave is 134217728, which is 2^27. Trying it with
>>     value "134217727" or lower does not create this idempotence problem.
>>     Trying R11B-5 under Windows Vista™ 32-bit does not exhibit such
>>     problem either, so it must be a specific issue to R12.
>>
>>     Please advise.
>>
>>
>>     P.S. Thanks to Denis Titoruk and Vladimir Serov for investigating
>>     this issue and coming up with a short test case.
>>
>>     -- 
>>     Lev Walkin
>>     vlm@REDACTED <mailto:vlm@REDACTED>
>>
>>     -module(nfail).
>>     -export([test/1]).
>>
>>     iterate(0, Acc) -> Acc;
>>     iterate(N, Acc) ->
>>            iterate(N - 1,
>>                    [case string:to_integer("134217728,\n") of
>>                            {Int, _} -> Int
>>                    end | Acc]).
>>
>>     test(N) ->
>>            {Self, Ref} = {self(), make_ref()},
>>            spawn_opt(fun()->
>>                    L = iterate(N, []),
>>                    BadList = [X || X <- L, X =/= 134217728],
>>                    BadLen = length(BadList),
>>                    io:format("~b bad conversions: ~p~n", [BadLen, BadList]),
>>                    Self ! {done, Ref, BadLen}
>>                    end,
>>                    [link,{fullsweep_after,0}]),
>>            receive {done, Ref, Len} -> Len end.
>>
>>     _______________________________________________
>>     erlang-bugs mailing list
>>     erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>>     http://www.erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>>
>> -- 
>> For every expert there is an equal and opposite expert - Arthur C. Clarke
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-bugs
> 
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
> 



More information about the erlang-bugs mailing list