[erlang-bugs] R12B-3: string:to_integer() sporadic failures
Dan Gudmundsson
dgud@REDACTED
Tue Sep 2 09:32:02 CEST 2008
We have found the error and a fix will be included the next release.
/Dan
Lev Walkin wrote:
> Denis just gave me an additional piece of information. It seems
> that this heap rehashing happens during a context switch.
> For instance, when adding a debugging code like this
>
> io:format("ProcInfo=[~p]~n",[process_info(self(),reductions)]),
>
> it shows that the number of reductions before and after the faulty
> string:to_integer/1 is, respectively, below and above 1000 reductions.
>
> This points to an intervening context switch, since, according to
> documentation, a forced context switch happens every 1000 reductions.
>
> In the Beam emulator, the reduction counter is normally
> incremented by one for each function and BIF call, and a context
> switch is forced when the counter reaches 1000.
> Source: erl -man erlang
>
> We have not been able to see whether the subsequent fault happens
> at the next (N mod 1000) reductions mark though, but it is likely.
>
>
> Edwin Fine wrote:
>> You did is a great piece of detective work. This really piqued my
>> interest and I did a bit of playing around.
>>
>> It seems that the failures always occur at the same point in the
>> sequence. What I did is add a sequence number to each conversion, then
>> print them out. I conjectured that the failed values are actually all
>> integers but internally incorrectly represented so that the inequality
>> test fails. I also looked at the C code for the BIF for
>> string:to_integer, which was ... interesting.
>>
>> The bottom line is that every time I ran the test code, the conversions
>> always failed at the same points (ran on Windows XP SP3, Erlang R12B-3).
>> I tried varying the erl parameters (turning off SMP, fiddling with
>> memory allocation, etc) but to no avail.
>>
>> 46> nfail2:test(1000000).
>> 6 bad conversions: [{308,{134217728,",\n"}},
>> {500,{134217728,",\n"}},
>> {1313,{134217728,",\n"}},
>> {3589,{134217728,",\n"}},
>> {5631,{134217728,",\n"}},
>> {61947,{134217728,",\n"}}]
>>
>> The only thing I could think of that would vary as the code ran is the
>> memory allocation. From lookoing at the C code, I could see that bignums
>> need heap allocation (a macro called HAlloc). If there was something in
>> the to_integer C code that messed up the process heap (by growing the
>> heap by too small an amount) so that every now and then, data was
>> written a tiny bit past the end of the heap without crashing Erlang,
>> this could account for the behavior. The reason the numbers are always
>> in the same places is because there are heap reallocations as the heap
>> grows, and if the bug is consistent, it will always screw up in the same
>> places.
>>
>> To test this theory, I started Erlang with the +h flag that sets the
>> heap for each process to an initial size, to see if delaying heap
>> allocation fixed the problem.
>>
>> It did fix the problem, sort of, well, it postponed it, as you will see.
>>
>> G:\misc>erl +h 10000
>> Eshell V5.6 (abort with ^G)
>> 1> nfail2:test(10000).
>> 0 bad conversions: []
>> 0
>> 2> nfail2:test(100000).
>> 2 bad conversions: [{5463,{134217728,",\n"}},{8964,{134217728,",\n"}}]
>> 2
>>
>> Notice that the places in which the error occurs are now different.
>>
>> Hope this helps.
>>
>>
>> 2008/9/1 Lev Walkin <vlm@REDACTED <mailto:vlm@REDACTED>>
>>
>>
>> Hi,
>>
>>
>> we all love string to integer conversion routines, such as
>> string:to_integer/1 or erlang:list_to_integer/1.
>>
>> The functions serve us well and indeed provide us with advertised
>> functionality almost every time.
>>
>> However, we've noticed some oddity on our production system which,
>> after two weeks of jaw-dropping musings, has boiled down to a strange
>> idempotence violation in the string:to_integer/1 implementation.
>>
>> To those of you impatient enough, please look into the code
>> attached and try running it as nfail:test(10000) and go down from
>> there.
>>
>> Here's that string:to_integer/1 function, having the following
>> signature:
>>
>> string:to_integer(String) -> {Int,Rest} | {error,Reason}
>>
>> Here's a typical invocation resulting in a number and the
>> rest of the unparsed string returned:
>>
>> 36> string:to_integer("134217728,\n").
>> {134217728,",\n"}
>> 37>
>>
>> What would happen if we did it once more?
>>
>> 37> string:to_integer("134217728,\n").
>> {134217728,",\n"}
>> 38>
>>
>> By this time we can be reasonably sure that string:to_integer/1 will
>> return the same output given the same input. We have seen that this
>> is indeed the case by testing it twice. Could it be that testing
>> it N times would result in a bad behavior? Nah, unlikely, you say.
>>
>> If you haven't looked at the attached code, it is time to do so.
>> In the code, we create a list of N results of the string:to_integer/1
>> application, like this:
>>
>> iterate(0, Acc) -> Acc;
>> iterate(N, Acc) ->
>> iterate(N - 1,
>> [case string:to_integer("134217728,\n") of
>> {Int, _} -> Int
>> end | Acc]).
>>
>> This code utilizes tail recursion with an accumulator list
>> which gets prepended N times by string:to_integer/1 output,
>> undoubtedly an integer. (Implementation with a map over lists:seq()
>> output can do as well).
>>
>> So we spawn and run this iterate/2 function, appropriately checking
>> that the list consists only of integers with value 134217728:
>>
>> test(N) ->
>> {Self, Ref} = {self(), make_ref()},
>> spawn_opt(fun()->
>> L = iterate(N, []),
>> % Filter out non-conforming entries
>> BadList = [X || X <- L, X =/= 134217728],
>> BadLen = length(BadList),
>> % Here, BadLen should always be 0!
>> io:format("~b bad conversions: ~p~n", [BadLen, BadList]),
>> Self ! {done, Ref, BadLen}
>> end,
>> [link,{fullsweep_after,0}]),
>> receive {done, Ref, Len} -> Len end.
>>
>> Based on the smoke tests above, this code should always result
>> in something like this:
>>
>> 38> nfail:test(100).
>> 0 bad conversions: []
>> 0
>> 39>
>>
>> But let's start testing it thorougly, as if not believing ourselves
>> that such a simple function could possibly fail at times:
>>
>> [vlm@REDACTED:~]> erl
>> Erlang (BEAM) emulator version 5.6.3 [source]
>> [async-threads:0] [kernel-poll:false]
>>
>> Eshell V5.6.3 (abort with ^G)
>> 1> c(nfail).
>> {ok,nfail}
>> 2> nfail:test(1).
>> 0 bad conversions: []
>> 0
>> 3> nfail:test(2).
>> 0 bad conversions: []
>> 0
>> 4> nfail:test(100).
>> 0 bad conversions: []
>> 0
>> 5> nfail:test(1000).
>> 2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>> 2
>> 6> nfail:test(10000).
>> 5 bad conversions: [{134217728,",\n"},
>> {134217728,",\n"},
>> {134217728,",\n"},
>> {134217728,",\n"},
>> {134217728,",\n"}]
>> 5
>> 7>
>> ...
>> 32> nfail:test(810).
>> 0 bad conversions: []
>> 0
>> 33> nfail:test(811).
>> 2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>> 2
>> 34> nfail:test(810).
>> 0 bad conversions: []
>> 0
>> 35> nfail:test(811).
>> 2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>> 2
>> 36>
>>
>>
>> As you see, string:to_integer/1 consistently generates bad entries
>> which contain {134217728,",\n"} instead of 134217728, but only
>> when the number of invocation reaches about 1000 (811 in this
>> particular sequence of steps).
>>
>> Perhaps this is a platform glitch? The above code was being executed
>> on a ppc (G4) Mac OS X 10.5 with R12B-3 (32-bit) built from scratch.
>> Here's what Sun sparc v9 with Solaris 10 thinks about that test case:
>>
>> [vlm@REDACTED:~]> erl
>> Erlang (BEAM) emulator version 5.6.3 [source]
>> [async-threads:0] [hipe] [kernel-poll:false]
>>
>> Eshell V5.6.3 (abort with ^G)
>> 1> c(nfail).
>> {ok,nfail}
>> 2> nfail:test(100).
>> 1 bad conversions: [{134217728,",\n"}]
>> 1
>> 3> nfail:test(1000).
>> 3 bad conversions:
>> [{134217728,",\n"},{134217728,",\n"},{134217728,",\n"}]
>> 3
>> 4> nfail:test(10).
>> 0 bad conversions: []
>> 0
>> 5> nfail:test(50).
>> 0 bad conversions: []
>> 0
>> 6> c(nfail, [native]).
>> {ok,nfail}
>> 7> nfail:test(500).
>> 4 bad conversions: [{134217728,",\n"},
>> {134217728,",\n"},
>> {134217728,",\n"},
>> {134217728,",\n"}]
>> 4
>> 8>
>>
>> And here's what 64-bit Intel box (am64, FreeBSD 6.3) thinks:
>>
>> [vlm@REDACTED ~]$ erl
>> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit]
>> [async-threads:0] [hipe] [kernel-poll:false]
>>
>> Eshell V5.6.3 (abort with ^G)
>> 1> c(nfail).
>> {ok,nfail}
>> 2> nfail:test(100).
>> 0 bad conversions: []
>> 0
>> 3> nfail:test(1000).
>> 0 bad conversions: []
>> 0
>> 4> nfail:test(10000).
>> 0 bad conversions: []
>> 0
>> 5> nfail:test(100000).
>> 0 bad conversions: []
>> 0
>> 6> nfail:test(1000000).
>> 0 bad conversions: []
>> 0
>> 7>
>>
>> Oops, it went very well, suprisingly. Perhaps, it is a purely
>> non-Intel chip problem? Let's try it on a non 64bit machine,
>> such as Pentium D under Microsoft Windows Vista™ in 32-bit mode:
>>
>> Erlang (BEAM) emulator version 5.6.3 [smp:2] [async-threads:0]
>>
>> Eshell V5.6.3 (abort with ^G)
>> 1> c('c:/tmp/nfail.erl').
>> {ok,nfail}
>> 2> nfail:test(100).
>> 0 bad conversions: []
>> 0
>> 3> nfail:test(1000).
>> 2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
>> 2
>> 4>
>>
>> Aha! See the pattern: 32-bit Erlang installations on many hardware
>> platforms are having very similar problems. They are unable to
>> consistently convert a string containing an integer into an integer
>> value. Incidentally enough, the integer value for which Erlang
>> starts to misbehave is 134217728, which is 2^27. Trying it with
>> value "134217727" or lower does not create this idempotence problem.
>> Trying R11B-5 under Windows Vista™ 32-bit does not exhibit such
>> problem either, so it must be a specific issue to R12.
>>
>> Please advise.
>>
>>
>> P.S. Thanks to Denis Titoruk and Vladimir Serov for investigating
>> this issue and coming up with a short test case.
>>
>> --
>> Lev Walkin
>> vlm@REDACTED <mailto:vlm@REDACTED>
>>
>> -module(nfail).
>> -export([test/1]).
>>
>> iterate(0, Acc) -> Acc;
>> iterate(N, Acc) ->
>> iterate(N - 1,
>> [case string:to_integer("134217728,\n") of
>> {Int, _} -> Int
>> end | Acc]).
>>
>> test(N) ->
>> {Self, Ref} = {self(), make_ref()},
>> spawn_opt(fun()->
>> L = iterate(N, []),
>> BadList = [X || X <- L, X =/= 134217728],
>> BadLen = length(BadList),
>> io:format("~b bad conversions: ~p~n", [BadLen, BadList]),
>> Self ! {done, Ref, BadLen}
>> end,
>> [link,{fullsweep_after,0}]),
>> receive {done, Ref, Len} -> Len end.
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>> http://www.erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>>
>> --
>> For every expert there is an equal and opposite expert - Arthur C. Clarke
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-bugs
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>
More information about the erlang-bugs
mailing list