[erlang-bugs] R12B-3: string:to_integer() sporadic failures
Lev Walkin
vlm@REDACTED
Tue Sep 2 09:21:13 CEST 2008
Denis just gave me an additional piece of information. It seems
that this heap rehashing happens during a context switch.
For instance, when adding a debugging code like this
io:format("ProcInfo=[~p]~n",[process_info(self(),reductions)]),
it shows that the number of reductions before and after the faulty
string:to_integer/1 is, respectively, below and above 1000 reductions.
This points to an intervening context switch, since, according to
documentation, a forced context switch happens every 1000 reductions.
In the Beam emulator, the reduction counter is normally
incremented by one for each function and BIF call, and a context
switch is forced when the counter reaches 1000.
Source: erl -man erlang
We have not been able to see whether the subsequent fault happens
at the next (N mod 1000) reductions mark though, but it is likely.
Edwin Fine wrote:
> You did is a great piece of detective work. This really piqued my
> interest and I did a bit of playing around.
>
> It seems that the failures always occur at the same point in the
> sequence. What I did is add a sequence number to each conversion, then
> print them out. I conjectured that the failed values are actually all
> integers but internally incorrectly represented so that the inequality
> test fails. I also looked at the C code for the BIF for
> string:to_integer, which was ... interesting.
>
> The bottom line is that every time I ran the test code, the conversions
> always failed at the same points (ran on Windows XP SP3, Erlang R12B-3).
> I tried varying the erl parameters (turning off SMP, fiddling with
> memory allocation, etc) but to no avail.
>
> 46> nfail2:test(1000000).
> 6 bad conversions: [{308,{134217728,",\n"}},
> {500,{134217728,",\n"}},
> {1313,{134217728,",\n"}},
> {3589,{134217728,",\n"}},
> {5631,{134217728,",\n"}},
> {61947,{134217728,",\n"}}]
>
> The only thing I could think of that would vary as the code ran is the
> memory allocation. From lookoing at the C code, I could see that bignums
> need heap allocation (a macro called HAlloc). If there was something in
> the to_integer C code that messed up the process heap (by growing the
> heap by too small an amount) so that every now and then, data was
> written a tiny bit past the end of the heap without crashing Erlang,
> this could account for the behavior. The reason the numbers are always
> in the same places is because there are heap reallocations as the heap
> grows, and if the bug is consistent, it will always screw up in the same
> places.
>
> To test this theory, I started Erlang with the +h flag that sets the
> heap for each process to an initial size, to see if delaying heap
> allocation fixed the problem.
>
> It did fix the problem, sort of, well, it postponed it, as you will see.
>
> G:\misc>erl +h 10000
> Eshell V5.6 (abort with ^G)
> 1> nfail2:test(10000).
> 0 bad conversions: []
> 0
> 2> nfail2:test(100000).
> 2 bad conversions: [{5463,{134217728,",\n"}},{8964,{134217728,",\n"}}]
> 2
>
> Notice that the places in which the error occurs are now different.
>
> Hope this helps.
>
>
> 2008/9/1 Lev Walkin <vlm@REDACTED <mailto:vlm@REDACTED>>
>
>
> Hi,
>
>
> we all love string to integer conversion routines, such as
> string:to_integer/1 or erlang:list_to_integer/1.
>
> The functions serve us well and indeed provide us with advertised
> functionality almost every time.
>
> However, we've noticed some oddity on our production system which,
> after two weeks of jaw-dropping musings, has boiled down to a strange
> idempotence violation in the string:to_integer/1 implementation.
>
> To those of you impatient enough, please look into the code
> attached and try running it as nfail:test(10000) and go down from
> there.
>
> Here's that string:to_integer/1 function, having the following
> signature:
>
> string:to_integer(String) -> {Int,Rest} | {error,Reason}
>
> Here's a typical invocation resulting in a number and the
> rest of the unparsed string returned:
>
> 36> string:to_integer("134217728,\n").
> {134217728,",\n"}
> 37>
>
> What would happen if we did it once more?
>
> 37> string:to_integer("134217728,\n").
> {134217728,",\n"}
> 38>
>
> By this time we can be reasonably sure that string:to_integer/1 will
> return the same output given the same input. We have seen that this
> is indeed the case by testing it twice. Could it be that testing
> it N times would result in a bad behavior? Nah, unlikely, you say.
>
> If you haven't looked at the attached code, it is time to do so.
> In the code, we create a list of N results of the string:to_integer/1
> application, like this:
>
> iterate(0, Acc) -> Acc;
> iterate(N, Acc) ->
> iterate(N - 1,
> [case string:to_integer("134217728,\n") of
> {Int, _} -> Int
> end | Acc]).
>
> This code utilizes tail recursion with an accumulator list
> which gets prepended N times by string:to_integer/1 output,
> undoubtedly an integer. (Implementation with a map over lists:seq()
> output can do as well).
>
> So we spawn and run this iterate/2 function, appropriately checking
> that the list consists only of integers with value 134217728:
>
> test(N) ->
> {Self, Ref} = {self(), make_ref()},
> spawn_opt(fun()->
> L = iterate(N, []),
> % Filter out non-conforming entries
> BadList = [X || X <- L, X =/= 134217728],
> BadLen = length(BadList),
> % Here, BadLen should always be 0!
> io:format("~b bad conversions: ~p~n", [BadLen, BadList]),
> Self ! {done, Ref, BadLen}
> end,
> [link,{fullsweep_after,0}]),
> receive {done, Ref, Len} -> Len end.
>
> Based on the smoke tests above, this code should always result
> in something like this:
>
> 38> nfail:test(100).
> 0 bad conversions: []
> 0
> 39>
>
> But let's start testing it thorougly, as if not believing ourselves
> that such a simple function could possibly fail at times:
>
> [vlm@REDACTED:~]> erl
> Erlang (BEAM) emulator version 5.6.3 [source]
> [async-threads:0] [kernel-poll:false]
>
> Eshell V5.6.3 (abort with ^G)
> 1> c(nfail).
> {ok,nfail}
> 2> nfail:test(1).
> 0 bad conversions: []
> 0
> 3> nfail:test(2).
> 0 bad conversions: []
> 0
> 4> nfail:test(100).
> 0 bad conversions: []
> 0
> 5> nfail:test(1000).
> 2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
> 2
> 6> nfail:test(10000).
> 5 bad conversions: [{134217728,",\n"},
> {134217728,",\n"},
> {134217728,",\n"},
> {134217728,",\n"},
> {134217728,",\n"}]
> 5
> 7>
> ...
> 32> nfail:test(810).
> 0 bad conversions: []
> 0
> 33> nfail:test(811).
> 2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
> 2
> 34> nfail:test(810).
> 0 bad conversions: []
> 0
> 35> nfail:test(811).
> 2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
> 2
> 36>
>
>
> As you see, string:to_integer/1 consistently generates bad entries
> which contain {134217728,",\n"} instead of 134217728, but only
> when the number of invocation reaches about 1000 (811 in this
> particular sequence of steps).
>
> Perhaps this is a platform glitch? The above code was being executed
> on a ppc (G4) Mac OS X 10.5 with R12B-3 (32-bit) built from scratch.
> Here's what Sun sparc v9 with Solaris 10 thinks about that test case:
>
> [vlm@REDACTED:~]> erl
> Erlang (BEAM) emulator version 5.6.3 [source]
> [async-threads:0] [hipe] [kernel-poll:false]
>
> Eshell V5.6.3 (abort with ^G)
> 1> c(nfail).
> {ok,nfail}
> 2> nfail:test(100).
> 1 bad conversions: [{134217728,",\n"}]
> 1
> 3> nfail:test(1000).
> 3 bad conversions:
> [{134217728,",\n"},{134217728,",\n"},{134217728,",\n"}]
> 3
> 4> nfail:test(10).
> 0 bad conversions: []
> 0
> 5> nfail:test(50).
> 0 bad conversions: []
> 0
> 6> c(nfail, [native]).
> {ok,nfail}
> 7> nfail:test(500).
> 4 bad conversions: [{134217728,",\n"},
> {134217728,",\n"},
> {134217728,",\n"},
> {134217728,",\n"}]
> 4
> 8>
>
> And here's what 64-bit Intel box (am64, FreeBSD 6.3) thinks:
>
> [vlm@REDACTED ~]$ erl
> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit]
> [async-threads:0] [hipe] [kernel-poll:false]
>
> Eshell V5.6.3 (abort with ^G)
> 1> c(nfail).
> {ok,nfail}
> 2> nfail:test(100).
> 0 bad conversions: []
> 0
> 3> nfail:test(1000).
> 0 bad conversions: []
> 0
> 4> nfail:test(10000).
> 0 bad conversions: []
> 0
> 5> nfail:test(100000).
> 0 bad conversions: []
> 0
> 6> nfail:test(1000000).
> 0 bad conversions: []
> 0
> 7>
>
> Oops, it went very well, suprisingly. Perhaps, it is a purely
> non-Intel chip problem? Let's try it on a non 64bit machine,
> such as Pentium D under Microsoft Windows Vista™ in 32-bit mode:
>
> Erlang (BEAM) emulator version 5.6.3 [smp:2] [async-threads:0]
>
> Eshell V5.6.3 (abort with ^G)
> 1> c('c:/tmp/nfail.erl').
> {ok,nfail}
> 2> nfail:test(100).
> 0 bad conversions: []
> 0
> 3> nfail:test(1000).
> 2 bad conversions: [{134217728,",\n"},{134217728,",\n"}]
> 2
> 4>
>
> Aha! See the pattern: 32-bit Erlang installations on many hardware
> platforms are having very similar problems. They are unable to
> consistently convert a string containing an integer into an integer
> value. Incidentally enough, the integer value for which Erlang
> starts to misbehave is 134217728, which is 2^27. Trying it with
> value "134217727" or lower does not create this idempotence problem.
> Trying R11B-5 under Windows Vista™ 32-bit does not exhibit such
> problem either, so it must be a specific issue to R12.
>
> Please advise.
>
>
> P.S. Thanks to Denis Titoruk and Vladimir Serov for investigating
> this issue and coming up with a short test case.
>
> --
> Lev Walkin
> vlm@REDACTED <mailto:vlm@REDACTED>
>
> -module(nfail).
> -export([test/1]).
>
> iterate(0, Acc) -> Acc;
> iterate(N, Acc) ->
> iterate(N - 1,
> [case string:to_integer("134217728,\n") of
> {Int, _} -> Int
> end | Acc]).
>
> test(N) ->
> {Self, Ref} = {self(), make_ref()},
> spawn_opt(fun()->
> L = iterate(N, []),
> BadList = [X || X <- L, X =/= 134217728],
> BadLen = length(BadList),
> io:format("~b bad conversions: ~p~n", [BadLen, BadList]),
> Self ! {done, Ref, BadLen}
> end,
> [link,{fullsweep_after,0}]),
> receive {done, Ref, Len} -> Len end.
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>
>
>
>
> --
> For every expert there is an equal and opposite expert - Arthur C. Clarke
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
More information about the erlang-bugs
mailing list