From luna@REDACTED Fri Dec 1 14:28:34 2006 From: luna@REDACTED (Daniel Luna) Date: Fri, 1 Dec 2006 14:28:34 +0100 (CET) Subject: [erlang-bugs] Bug in qlc Message-ID: I'm wasting a really good obfuscated Erlang contribution here... See code below. qlc "A" works if you remove ", true". Both work if you remove disc_only_copies. This error occurs in R11B-2, but not in R11B-1. /Daniel Luna %% -*- erlang-indent-level: 2 -*- -module(qlc_bug). -compile(export_all). -include_lib("stdlib/include/qlc.hrl"). provoke_error()-> application:set_env(mnesia,dir,"/tmp/"), mnesia:create_schema([node()]), ok = mnesia:start(), case mnesia:create_table(blah, [{disc_only_copies,[node()]}, {type,set}]) of {atomic,ok} -> ok; {aborted,{already_exists,blah}} -> ok end, A= qlc:q([F || F <-mnesia:table(blah), (F band ((1 bsl 0)) =/= 0), true]), {atomic,Result} = mnesia:transaction(fun() -> qlc:eval(A)end), case is_list(Result) of true -> io:format("List~n",[]); false -> io:format("This should not happen!~nResult = ~p~n",[Result]) end, B = qlc:q(["" || I <- mnesia:table(blah), any:function_call(I)]), {atomic,BR} = mnesia:transaction(fun()->qlc:eval(B)end), case is_list(BR) of true -> io:format("List~n",[]); false -> io:format("This should not happen!~nResult = ~p~n",[BR]) end. From tobbe@REDACTED Mon Dec 4 14:45:33 2006 From: tobbe@REDACTED (Torbjorn Tornkvist) Date: Mon, 04 Dec 2006 14:45:33 +0100 Subject: [erlang-bugs] Bug in inet_res:gethostbyaddr/2 Message-ID: From: otp_src_R11B-1/lib/kernel/src/inet_res.erl Look at the call to inet:stop_timer/1: gethostbyaddr(IP,Timeout) -> Timer = inet:start_timer(Timeout), Res = gethostbyaddr_tm(IP,Timer), inet:stop_timer(Timeout), Res. I guess the argument 'Timeout' should be 'Timer' instead !? --Tobbe From christophe.romain@REDACTED Tue Dec 5 00:47:07 2006 From: christophe.romain@REDACTED (Christophe Romain) Date: Tue, 5 Dec 2006 00:47:07 +0100 Subject: [erlang-bugs] --enable-hipe on pxa255 Message-ID: I'm compiling R11B-2 on XScale-PXA255 rev 6 (v5l), gcc-4.1.1, binutils-2.17.50, libc6-2.5 Target: arm-linux-gnueabi soft-float uname -m -> armv5tel when i configure with --enable-hipe, I could not compile erl_bif_info. the problem is that erts/emulator/armv5tel-unknown-linux-gnu/ erl_atom_table.h does not contains the last 3 lines (don't ask me why): #define am_fconv_constant make_atom(807) #define am_inc_stack_0 make_atom(808) #define am_arm make_atom(809) finally, link stage did not pass. the problem is that all erts/emulator/obj/armv5tel-unknown-linux-gnu/ opt/hybrid/hipe_arm*.o are not there. the whole problem is I have ARCH=noarch defined into erts/emulator/ armv5tel-unknown-linux-gnu/Makefile instead of ARCH=arm correction: ARCH is defined by configure, into erts/configure.in we can see: 294 ARCH=noarch 295 case `uname -m` in [...CUT...] 310 armv5b) ARCH=arm;; 311 armv5teb) ARCH=arm;; 312 esac here is the point: armv5tel is not listed may "armv5*) ARCH=arm;;" be a safe patch ? and what about armv4l ? (does hipe support that architecture) From mikpe@REDACTED Tue Dec 5 09:29:46 2006 From: mikpe@REDACTED (Mikael Pettersson) Date: Tue, 5 Dec 2006 09:29:46 +0100 (MET) Subject: [erlang-bugs] --enable-hipe on pxa255 Message-ID: <200612050829.kB58TkB7002127@harpo.it.uu.se> On Tue, 5 Dec 2006 00:47:07 +0100, Christophe Romain wrote: > I'm compiling R11B-2 on XScale-PXA255 rev 6 (v5l), gcc-4.1.1, > binutils-2.17.50, libc6-2.5 > Target: arm-linux-gnueabi soft-float > uname -m -> armv5tel > > when i configure with --enable-hipe, I could not compile erl_bif_info. > the problem is that erts/emulator/armv5tel-unknown-linux-gnu/ > erl_atom_table.h does not contains the last 3 lines (don't ask me why): > #define am_fconv_constant make_atom(807) > #define am_inc_stack_0 make_atom(808) > #define am_arm make_atom(809) They're not there because ARCH=noarch (see below). > finally, link stage did not pass. > the problem is that all erts/emulator/obj/armv5tel-unknown-linux-gnu/ > opt/hybrid/hipe_arm*.o are not there. > > the whole problem is I have ARCH=noarch defined into erts/emulator/ > armv5tel-unknown-linux-gnu/Makefile instead of ARCH=arm > > correction: > ARCH is defined by configure, into erts/configure.in we can see: > 294 ARCH=noarch > 295 case `uname -m` in > [...CUT...] > 310 armv5b) ARCH=arm;; > 311 armv5teb) ARCH=arm;; > 312 esac > > here is the point: armv5tel is not listed > may "armv5*) ARCH=arm;;" be a safe patch ? Not at the moment. > and what about armv4l ? (does hipe support that architecture) HiPE currently requires ARMv5b, i.e. ARMv5 (or better) in big-endian mode. Supporting pre-ARMv5 would require generating worse code, something I really don't want to do unconditionally. There could possibly be a HiPE compiler target option for selecting ARMv4 output, but that would also need object code format changes and new checks to verify object code compatibility before loading it. ARMv4 is ancient so why bother? Supporting little-endian would require a mechanism for communicating the endianess of the target to the compiler. Like ARMv4 it would also need object code format changes and new compatibility checks in the loader. I know many ARMs run in little-endian mode, so this change has some merit; however all my ARMs are big-endian XScales so I wouldn't be able to test it. /Mikael From christophe.romain@REDACTED Tue Dec 5 09:49:37 2006 From: christophe.romain@REDACTED (Christophe Romain) Date: Tue, 5 Dec 2006 09:49:37 +0100 Subject: [erlang-bugs] --enable-hipe on pxa255 In-Reply-To: <200612050829.kB58TkB7002127@harpo.it.uu.se> References: <200612050829.kB58TkB7002127@harpo.it.uu.se> Message-ID: <412D4FF2-9B9E-406C-8A4A-D0683A346C85@process-one.net> > ARMv4 is ancient so why bother? this is just my curiosity, I don't use armv4 anyway. > however all my ARMs are big-endian XScales so I wouldn't be able to > test it. i would ! From hans.bolinder@REDACTED Fri Dec 8 08:47:20 2006 From: hans.bolinder@REDACTED (Hans Bolinder) Date: Fri, 8 Dec 2006 08:47:20 +0100 Subject: [erlang-bugs] Bug in qlc In-Reply-To: References: Message-ID: <17785.6280.103619.45333@gargle.gargle.HOWL> [Daniel Luna:] > See code below. > > qlc "A" works if you remove ", true". > > Both work if you remove disc_only_copies. > > This error occurs in R11B-2, but not in R11B-1. Thanks for the bug report. It seems that a modification in R11B-2 exposed a bug that has been present for quite some time. The bug was triggered by the empty Dets table, but could also show up when no object matches a given match specification. The following patch ensures that dets:match{_object}() and dets:select() never return {[],Cont} which should solve the problem at hand. *** /usr/local/otp/releases/otp_beam_solaris8_r11b_patched/lib/stdlib-1.14.2/src/dets.erl Thu Nov 16 17:40:35 2006 --- dets.erl Wed Dec 6 12:43:19 2006 *************** *** 718,723 **** --- 718,725 ---- false -> badarg end; + [] -> + chunk_match(NewState); Terms -> {Terms, NewState} end; The patch erl_898_otp_beam is available at the Licencees' Area (http://www.erlang.se/lic_area/index.shtml). Best regards, Hans Bolinder, Erlang/OTP From mikpe@REDACTED Tue Dec 19 21:59:27 2006 From: mikpe@REDACTED (Mikael Pettersson) Date: Tue, 19 Dec 2006 21:59:27 +0100 Subject: [erlang-bugs] [erlang-questions] R11B-2 on Solaris 10 kpoll problem In-Reply-To: <45883747.7080102@hq.idt.net> References: <45883747.7080102@hq.idt.net> Message-ID: <17800.21167.204377.846614@alkaid.it.uu.se> Serge Aleynikov writes: > Hi, > > I am experiencing a problem running R11B-2 on Solaris 10 with kpoll enabled: > > $ gtar -zxf otp_src_R11B-2.tar.gz > $ cd otp_src_R11B-2 > $ ./configure --prefix=/opt/erlang/R11B-2 > $ gmake > $ sudo gmake install > $ cd /opt/erlang/R11B-2/bin > $ ./erl > Erlang (BEAM) emulator version 5.5.2 [source] [async-threads:0] [hipe] > [kernel-poll:false] > > Eshell V5.5.2 (abort with ^G) > 1> q(). > > $ ./erl +K true > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > ^C > > $ uname -a > SunOS devstorm10 5.10 Generic_118844-28 i86pc i386 i86pc > > Any suggestions? No suggestion, only to confirm that I see the exact same issue on SunOS xxxx.xx.xx.xx 5.10 Generic_118855-19 i86pc i386 i86pc which is an amd64 box, and OTP was built in 64-bit mode. /Mikael From rickard.s.green@REDACTED Wed Dec 20 11:49:04 2006 From: rickard.s.green@REDACTED (Rickard Green) Date: Wed, 20 Dec 2006 11:49:04 +0100 Subject: [erlang-bugs] [erlang-questions] R11B-2 on Solaris 10 kpoll problem In-Reply-To: <45883747.7080102@hq.idt.net> References: <45883747.7080102@hq.idt.net> Message-ID: <45891520.1030806@ericsson.com> It works fine on our Solaris 10 sparc machines (unfortunately we do not have any Solaris 10/x86 machines yet) by default, but if I lower max open files to 256 (or lower) I get the same problem. I googled a bit on this and apparently Solaris 10 doesn't want the size of the result array passed in the /dev/poll ioctl to be greater than OPEN_MAX. The size of the result array in our case is 256. Hopefully I'll find the time to look closer at this before we release r11b-3. Increasing max files to something larger than 256 (ulimit -n) will hopefully work as a workaround for you. BR, Rickard Green, Erlang/OTP Serge Aleynikov wrote: > Hi, > > I am experiencing a problem running R11B-2 on Solaris 10 with kpoll enabled: > > $ gtar -zxf otp_src_R11B-2.tar.gz > $ cd otp_src_R11B-2 > $ ./configure --prefix=/opt/erlang/R11B-2 > $ gmake > $ sudo gmake install > $ cd /opt/erlang/R11B-2/bin > $ ./erl > Erlang (BEAM) emulator version 5.5.2 [source] [async-threads:0] [hipe] > [kernel-poll:false] > > Eshell V5.5.2 (abort with ^G) > 1> q(). > > $ ./erl +K true > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > {error_logger,{{2006,12,19},{13,57,10}},"~s~n",["erts_poll_wait() > failed: einval (22)\n"]} > ^C > > $ uname -a > SunOS devstorm10 5.10 Generic_118844-28 i86pc i386 i86pc > > Any suggestions? > > Serge > From matthew.reilly@REDACTED Wed Dec 20 23:48:27 2006 From: matthew.reilly@REDACTED (Matthew Reilly) Date: Wed, 20 Dec 2006 14:48:27 -0800 Subject: [erlang-bugs] Buffer overflow in etrs_save_mb[] Message-ID: <1166654907.1243.229.camel@matt.sipphone.com> Erlang/OTP team, There is a buffer overflow that can occur in the beam interpreter in the etrs_save_mb[] array. This can occur if a module contains a large number of binary matches. This buffer overflow is known to exist for all Erlang/OTP versions between R10B-6 and R11B-2. Sample code is attached that causes a SEGV (at least on Linux systems). The source of the overflow is the opcode "bs_save". The code in BEAM that interprets this opcode is in erts/emulator/beam/beam_emu.c:process_main(): OpCase(bs_save_I): { Eterm* next; PreFetch(1, next); erts_save_mb[Arg(0)] = erts_mb; NextPF(1, next); } This saves the current value of "erts_mb" into the array "erts_save_mb[1024]" at an offset of "Arg(0)". This code does not do any bounds checking on this array. If a compiled ".beam" file has an opcode such as "{bs_save,1040}", then the beam interpreter's memory will be corrupted. The erlang byte-code compiler can easily generate arguments to "bs_save" that are greater than 1024. If a module contains matching to static binary values, the compiler generates code for each byte in the binaries such as: ... {bs_save,1040}. {test,is_eq_exact,{f,1},[{x,12},{integer,21}]}. {bs_restore,1040}. ... {bs_save,1041}. {test,is_eq_exact,{f,1},[{x,13},{integer,182}]}. {bs_restore,1041}. ... Even though the saved value is never referenced again, the compiler increments the arg to "bs_save" and it does so without checking if the argument exceeds the size of the "erts_save_mb[]" array. I have not tried this code with HiPE, but since HiPE uses the function erts/emulator/beam/erl_bits.c:erts_bs_save() (which also does not check the bounds of erts_save_mb), it most likely has the same bug: void erts_bs_save(ERL_BITS_PROTO_1(int index)) { erts_save_mb[index] = erts_mb; } Most likely, to avoid this bug: - the compiler should re-use args to "bs_save" (especially where it immediately does a "bs_restore") instead of just incremented it each byte. - the beam runtime should check the offsets for "bs_save" at beam file load time to make sure it is < MAX_REG (perhaps in beam_validator?) Thank you, Matt Reilly SIPphone Inc. %%%%%%%%%%%% Begin buffer_overflow.erl %%%%%%%%%%%%% %% Test with: %% $ erlc buffer_overflow.erl %% $ erl -s buffer_overflow crash -module(buffer_overflow). -export([ crash/0 ]). m(<<196,202,66,56,160,185,35,130,13,204,80,154,111,117,132,155>>) -> 1; m(<<200,30,114,141,157,76,47,99,111,6,127,137,204,20,134,44>>) -> 2; m(<<236,203,200,126,75,92,226,254,40,48,143,217,242,167,186,243>>) -> 3; m(<<168,127,246,121,162,243,231,29,145,129,166,123,117,66,18,44>>) -> 4; m(<<228,218,59,127,187,206,35,69,215,119,43,6,116,163,24,213>>) -> 5; m(<<22,121,9,28,90,136,15,175,111,181,230,8,126,177,178,220>>) -> 6; m(<<143,20,228,95,206,234,22,122,90,54,222,221,75,234,37,67>>) -> 7; m(<<201,240,248,149,251,152,171,145,89,245,31,208,41,126,35,109>>) -> 8; m(<<69,196,140,206,46,45,127,189,234,26,252,81,199,198,173,38>>) -> 9; m(<<211,217,68,104,2,164,66,89,117,93,56,230,209,99,232,32>>) -> 10; m(<<101,18,189,67,217,202,166,224,44,153,11,10,130,101,45,202>>) -> 11; m(<<194,10,212,215,111,233,119,89,170,39,160,201,155,255,103,16>>) ->12; m(<<197,28,228,16,193,36,161,14,13,181,228,185,127,194,175,57>>) -> 13; m(<<170,179,35,137,34,188,194,90,111,96,110,181,37,255,220,86>>) -> 14; m(<<155,243,28,127,240,98,147,106,150,211,200,189,31,143,47,243>>) ->15; m(<<199,77,151,176,30,174,37,126,68,170,157,91,173,233,123,175>>) -> 16; m(<<112,239,223,46,201,176,134,7,151,149,196,66,99,107,85,251>>) -> 17; m(<<111,73,34,244,85,104,22,26,140,223,74,210,41,159,109,35>>) -> 18; m(<<31,14,61,173,153,144,131,69,247,67,159,143,250,189,255,196>>) -> 19; m(<<152,241,55,8,33,1,148,196,117,104,123,230,16,106,59,132>>) -> 20; m(<<60,89,220,4,142,136,80,36,59,232,7,154,92,116,208,121>>) -> 21; m(<<182,215,103,210,248,237,93,33,164,75,14,88,134,104,12,185>>) -> 22; m(<<55,105,60,252,116,128,73,228,93,135,184,199,216,185,170,205>>) ->23; m(<<31,241,222,119,64,5,248,218,19,244,41,67,136,28,101,95>>) -> 24; m(<<142,41,106,6,122,55,86,51,112,222,208,95,90,59,243,236>>) -> 25; m(<<78,115,44,237,52,99,208,109,224,202,154,21,182,21,54,119>>) -> 26; m(<<2,231,79,16,224,50,122,216,104,209,56,242,180,253,214,240>>) -> 27; m(<<51,231,95,240,157,214,1,187,230,159,53,16,57,21,33,137>>) -> 28; m(<<110,169,171,27,170,14,251,158,25,9,68,64,195,23,226,27>>) -> 29; m(<<52,23,60,179,143,7,248,157,219,235,194,172,145,40,48,63>>) -> 30; m(<<193,106,83,32,250,71,85,48,217,88,60,52,253,53,110,245>>) -> 31; m(<<99,100,211,240,244,149,182,171,157,207,141,59,92,110,11,1>>) -> 32; m(<<24,43,224,197,205,205,80,114,187,24,100,205,238,77,61,110>>) -> 33; m(<<227,105,133,61,247,102,250,68,225,237,15,246,19,245,99,189>>) -> 34; m(<<28,56,60,211,11,124,41,138,181,2,147,173,254,203,123,24>>) -> 35; m(<<25,202,20,231,234,99,40,164,46,14,177,61,88,94,76,34>>) -> 36; m(<<165,191,201,224,121,100,248,221,222,185,95,197,132,205,150,93>>)->37; m(<<165,119,27,206,147,226,0,195,111,124,217,223,208,229,222,170>>)->38; m(<<214,125,138,180,244,193,11,242,42,163,83,226,120,121,19,60>>) -> 39; m(<<214,69,146,14,57,95,237,173,123,187,237,14,202,63,226,224>>) -> 40; m(<<52,22,167,95,76,234,145,9,80,124,172,216,226,242,174,252>>) -> 41; m(<<161,208,198,232,63,2,115,39,216,70,16,99,244,172,88,166>>) -> 42; m(<<23,230,33,102,252,133,134,223,164,209,188,14,23,66,192,139>>) -> 43; m(<<247,23,113,99,200,51,223,244,179,143,200,210,135,47,30,198>>) -> 44; m(<<108,131,73,204,114,96,174,98,227,177,57,104,49,168,57,143>>) -> 45; m(<<217,212,244,149,232,117,162,224,117,161,164,166,225,185,119,15>>)->46; m(<<103,198,161,231,206,86,211,214,250,116,138,182,217,175,63,215>>)->47; m(<<100,46,146,239,183,148,33,115,72,129,181,62,30,27,24,182>>) -> 48; m(<<244,87,197,69,169,222,216,143,24,236,238,71,20,90,114,192>>) -> 49; m(<<192,199,199,109,48,189,61,202,239,201,111,64,39,91,220,10>>) -> 50; m(<<40,56,2,58,119,141,250,236,220,33,39,8,247,33,183,136>>) -> 51; m(<<154,17,88,21,77,250,66,202,221,189,6,148,164,233,189,200>>) -> 52; m(<<216,44,141,22,25,173,129,118,214,101,69,60,251,46,85,240>>) -> 53; m(<<166,132,236,238,231,111,197,34,119,50,134,168,149,188,132,54>>)->54; m(<<181,59,58,61,106,185,12,224,38,130,41,21,28,155,222,17>>) -> 55; m(<<159,97,64,142,58,251,99,62,80,205,241,178,13,230,244,102>>) -> 56; m(<<114,179,42,31,117,75,161,192,155,54,149,224,203,108,222,127>>) ->57; m(<<102,240,65,225,106,96,146,139,5,167,226,40,168,156,55,153>>) -> 58; m(<<9,63,101,224,128,162,149,248,7,107,28,87,34,164,106,162>>) -> 59; m(<<7,43,3,11,161,38,178,244,178,55,79,52,43,233,237,68>>) -> 60; m(<<127,57,248,49,127,189,177,152,142,244,198,40,235,160,37,145>>) ->61; m(<<68,246,131,168,65,99,179,82,58,254,87,194,224,8,188,140>>) -> 62; m(<<3,175,219,214,110,121,41,177,37,248,89,120,52,250,131,164>>) -> 63; m(<<234,93,47,28,70,8,35,46,7,211,170,61,153,142,81,53>>) -> 64; m(<<252,73,12,164,92,0,177,36,155,190,53,84,164,253,246,251>>) -> 65; m(<<50,149,199,106,203,244,202,174,211,60,54,177,181,252,44,177>>) ->66; m(<<115,91,144,180,86,129,37,237,108,63,103,136,25,182,224,88>>) -> 67; m(<<163,243,144,216,142,76,65,242,116,123,250,47,27,95,135,219>>) -> 68; m(<<20,191,166,187,20,135,94,69,187,160,40,162,30,211,128,70>>) -> 69; m(<<124,187,196,9,236,153,15,25,199,140,117,189,30,6,242,21>>) -> 70; m(<<226,196,32,217,40,212,191,140,224,255,46,193,155,55,21,20>>) -> 71; m(<<50,187,144,232,151,106,171,82,152,213,218,16,254,102,242,29>>) ->72; m(<<210,221,234,24,240,6,101,206,134,35,227,107,212,227,199,197>>) ->73; m(<<173,97,171,20,50,35,239,188,36,199,210,88,59,230,146,81>>) -> 74; m(<<208,155,244,21,68,163,54,90,70,201,7,126,187,94,53,195>>) -> 75; m(<<251,215,147,157,103,73,151,205,180,105,45,52,222,134,51,196>>) ->76; m(<<40,221,44,121,85,206,146,100,86,36,11,47,240,16,11,222>>) -> 77; m(<<53,244,168,212,101,230,225,237,192,95,61,138,182,88,197,81>>) -> 78; m(<<209,254,23,61,8,233,89,57,122,223,52,177,215,126,136,215>>) -> 79; m(<<240,51,171,55,195,2,1,247,63,20,36,73,208,55,2,141>>) -> 80; m(<<67,236,81,125,104,182,237,211,1,91,62,220,154,17,54,123>>) -> 81; m(<<151,120,213,210,25,197,8,11,154,106,23,190,240,41,51,28>>) -> 82; m(<<254,159,194,137,195,255,10,241,66,182,211,190,173,152,169,35>>)->83; m(<<104,211,10,149,148,114,139,195,154,162,75,233,75,49,157,33>>) -> 84; m(<<62,248,21,65,111,119,80,152,254,151,112,4,1,92,97,147>>) -> 85; m(<<147,219,133,237,144,156,19,131,143,249,92,207,169,76,235,217>>)->86; m(<<199,225,36,159,252,3,235,157,237,144,140,35,107,209,153,109>>) ->87; m(<<42,56,164,169,49,108,73,229,168,51,81,124,69,211,16,112>>) -> 88; m(<<118,71,150,107,115,67,194,144,72,103,50,82,228,144,247,54>>) -> 89; m(<<134,19,152,94,196,158,184,247,87,174,100,57,232,121,187,42>>) -> 90; m(<<84,34,154,191,207,165,100,158,112,3,184,61,212,117,82,148>>) -> 91; m(<<146,204,34,117,50,209,126,86,224,121,2,178,84,223,173,16>>) -> 92; m(<<152,220,232,61,165,123,3,149,225,99,70,124,157,174,82,27>>) -> 93; m(<<244,185,236,48,173,159,104,248,155,41,99,151,134,203,98,239>>) ->94; m(<<129,43,75,162,135,245,238,11,201,212,59,191,91,190,135,251>>) -> 95; m(<<38,101,125,95,249,2,13,42,190,254,85,135,150,185,149,132>>) -> 96; m(<<226,239,82,79,191,61,159,230,17,213,168,233,15,239,220,156>>) -> 97; m(<<237,61,44,33,153,30,59,239,94,6,151,19,175,159,166,202>>) -> 98; m(<<172,98,122,177,204,189,182,46,201,110,112,47,7,246,66,91>>) -> 99; m(<<248,153,19,157,245,225,5,147,150,67,20,21,231,112,198,221>>) -> 100. crash() -> lists:foreach( fun(N) -> io:format("Trying to match md5(~B)~n",[N]), BinaryToMatch = erlang:md5(integer_to_list(N)), %% The pattern matching on binaries for this call %% will cause buffer overflow. m(BinaryToMatch) end, lists:seq(1,100) ). %%%%%%%%%%%% End buffer_overflow.erl %%%%%%%%%%%%% -- Matthew Reilly SIPphone Inc. matthew.reilly@REDACTED Gizmo Project name: matt From bjorn@REDACTED Thu Dec 21 15:21:15 2006 From: bjorn@REDACTED (Bjorn Gustavsson) Date: 21 Dec 2006 15:21:15 +0100 Subject: [erlang-bugs] Buffer overflow in etrs_save_mb[] In-Reply-To: <1166654907.1243.229.camel@matt.sipphone.com> References: <1166654907.1243.229.camel@matt.sipphone.com> Message-ID: Thanks for your bug report! R11B has new instructions for bit syntax matching, so the problem doesn't exist there, unless you load code compiled by an R10B compiler. For R11B-3, I have changed the loader to reject bs_save/bs_restore instruction with an index greater than 1023 if you load code compiled by R10B. I have also changed beam_validator to reject those instructions (in case it is used for validating R10B code). /Bjorn Matthew Reilly writes: > Erlang/OTP team, > > There is a buffer overflow that can occur in the beam interpreter in > the etrs_save_mb[] array. This can occur if a module contains a large > number of binary matches. This buffer overflow is known to exist for all > Erlang/OTP versions between R10B-6 and R11B-2. Sample code is attached > that causes a SEGV (at least on Linux systems). > > The source of the overflow is the opcode "bs_save". > > The code in BEAM that interprets this opcode is in > erts/emulator/beam/beam_emu.c:process_main(): > OpCase(bs_save_I): { > Eterm* next; > > PreFetch(1, next); > erts_save_mb[Arg(0)] = erts_mb; > NextPF(1, next); > } > > This saves the current value of "erts_mb" into the array > "erts_save_mb[1024]" at an offset of "Arg(0)". This code does not do > any bounds checking on this array. If a compiled ".beam" file has an > opcode such as "{bs_save,1040}", then the beam interpreter's memory will > be corrupted. > > The erlang byte-code compiler can easily generate arguments to "bs_save" > that are greater than 1024. If a module contains matching to static > binary values, the compiler generates code for each byte in the binaries > such as: > ... > {bs_save,1040}. > {test,is_eq_exact,{f,1},[{x,12},{integer,21}]}. > {bs_restore,1040}. > ... > {bs_save,1041}. > {test,is_eq_exact,{f,1},[{x,13},{integer,182}]}. > {bs_restore,1041}. > ... > > Even though the saved value is never referenced again, the compiler > increments the arg to "bs_save" and it does so without checking if the > argument exceeds the size of the "erts_save_mb[]" array. > > I have not tried this code with HiPE, but since HiPE uses the function > erts/emulator/beam/erl_bits.c:erts_bs_save() (which also does not check > the bounds of erts_save_mb), it most likely has the same bug: > void erts_bs_save(ERL_BITS_PROTO_1(int index)) > { > erts_save_mb[index] = erts_mb; > } > > > Most likely, to avoid this bug: > > - the compiler should re-use args to "bs_save" (especially where it > immediately does a "bs_restore") instead of just incremented it each > byte. > > - the beam runtime should check the offsets for "bs_save" at beam file > load time to make sure it is < MAX_REG (perhaps in beam_validator?) > > > Thank you, > Matt Reilly > SIPphone Inc. > > %%%%%%%%%%%% Begin buffer_overflow.erl %%%%%%%%%%%%% > %% Test with: > %% $ erlc buffer_overflow.erl > %% $ erl -s buffer_overflow crash > -module(buffer_overflow). > > -export([ > crash/0 > ]). > > m(<<196,202,66,56,160,185,35,130,13,204,80,154,111,117,132,155>>) -> 1; > m(<<200,30,114,141,157,76,47,99,111,6,127,137,204,20,134,44>>) -> 2; > m(<<236,203,200,126,75,92,226,254,40,48,143,217,242,167,186,243>>) -> 3; > m(<<168,127,246,121,162,243,231,29,145,129,166,123,117,66,18,44>>) -> 4; > m(<<228,218,59,127,187,206,35,69,215,119,43,6,116,163,24,213>>) -> 5; > m(<<22,121,9,28,90,136,15,175,111,181,230,8,126,177,178,220>>) -> 6; > m(<<143,20,228,95,206,234,22,122,90,54,222,221,75,234,37,67>>) -> 7; > m(<<201,240,248,149,251,152,171,145,89,245,31,208,41,126,35,109>>) -> 8; > m(<<69,196,140,206,46,45,127,189,234,26,252,81,199,198,173,38>>) -> 9; > m(<<211,217,68,104,2,164,66,89,117,93,56,230,209,99,232,32>>) -> 10; > m(<<101,18,189,67,217,202,166,224,44,153,11,10,130,101,45,202>>) -> 11; > m(<<194,10,212,215,111,233,119,89,170,39,160,201,155,255,103,16>>) ->12; > m(<<197,28,228,16,193,36,161,14,13,181,228,185,127,194,175,57>>) -> 13; > m(<<170,179,35,137,34,188,194,90,111,96,110,181,37,255,220,86>>) -> 14; > m(<<155,243,28,127,240,98,147,106,150,211,200,189,31,143,47,243>>) ->15; > m(<<199,77,151,176,30,174,37,126,68,170,157,91,173,233,123,175>>) -> 16; > m(<<112,239,223,46,201,176,134,7,151,149,196,66,99,107,85,251>>) -> 17; > m(<<111,73,34,244,85,104,22,26,140,223,74,210,41,159,109,35>>) -> 18; > m(<<31,14,61,173,153,144,131,69,247,67,159,143,250,189,255,196>>) -> 19; > m(<<152,241,55,8,33,1,148,196,117,104,123,230,16,106,59,132>>) -> 20; > m(<<60,89,220,4,142,136,80,36,59,232,7,154,92,116,208,121>>) -> 21; > m(<<182,215,103,210,248,237,93,33,164,75,14,88,134,104,12,185>>) -> 22; > m(<<55,105,60,252,116,128,73,228,93,135,184,199,216,185,170,205>>) ->23; > m(<<31,241,222,119,64,5,248,218,19,244,41,67,136,28,101,95>>) -> 24; > m(<<142,41,106,6,122,55,86,51,112,222,208,95,90,59,243,236>>) -> 25; > m(<<78,115,44,237,52,99,208,109,224,202,154,21,182,21,54,119>>) -> 26; > m(<<2,231,79,16,224,50,122,216,104,209,56,242,180,253,214,240>>) -> 27; > m(<<51,231,95,240,157,214,1,187,230,159,53,16,57,21,33,137>>) -> 28; > m(<<110,169,171,27,170,14,251,158,25,9,68,64,195,23,226,27>>) -> 29; > m(<<52,23,60,179,143,7,248,157,219,235,194,172,145,40,48,63>>) -> 30; > m(<<193,106,83,32,250,71,85,48,217,88,60,52,253,53,110,245>>) -> 31; > m(<<99,100,211,240,244,149,182,171,157,207,141,59,92,110,11,1>>) -> 32; > m(<<24,43,224,197,205,205,80,114,187,24,100,205,238,77,61,110>>) -> 33; > m(<<227,105,133,61,247,102,250,68,225,237,15,246,19,245,99,189>>) -> 34; > m(<<28,56,60,211,11,124,41,138,181,2,147,173,254,203,123,24>>) -> 35; > m(<<25,202,20,231,234,99,40,164,46,14,177,61,88,94,76,34>>) -> 36; > m(<<165,191,201,224,121,100,248,221,222,185,95,197,132,205,150,93>>)->37; > m(<<165,119,27,206,147,226,0,195,111,124,217,223,208,229,222,170>>)->38; > m(<<214,125,138,180,244,193,11,242,42,163,83,226,120,121,19,60>>) -> 39; > m(<<214,69,146,14,57,95,237,173,123,187,237,14,202,63,226,224>>) -> 40; > m(<<52,22,167,95,76,234,145,9,80,124,172,216,226,242,174,252>>) -> 41; > m(<<161,208,198,232,63,2,115,39,216,70,16,99,244,172,88,166>>) -> 42; > m(<<23,230,33,102,252,133,134,223,164,209,188,14,23,66,192,139>>) -> 43; > m(<<247,23,113,99,200,51,223,244,179,143,200,210,135,47,30,198>>) -> 44; > m(<<108,131,73,204,114,96,174,98,227,177,57,104,49,168,57,143>>) -> 45; > m(<<217,212,244,149,232,117,162,224,117,161,164,166,225,185,119,15>>)->46; > m(<<103,198,161,231,206,86,211,214,250,116,138,182,217,175,63,215>>)->47; > m(<<100,46,146,239,183,148,33,115,72,129,181,62,30,27,24,182>>) -> 48; > m(<<244,87,197,69,169,222,216,143,24,236,238,71,20,90,114,192>>) -> 49; > m(<<192,199,199,109,48,189,61,202,239,201,111,64,39,91,220,10>>) -> 50; > m(<<40,56,2,58,119,141,250,236,220,33,39,8,247,33,183,136>>) -> 51; > m(<<154,17,88,21,77,250,66,202,221,189,6,148,164,233,189,200>>) -> 52; > m(<<216,44,141,22,25,173,129,118,214,101,69,60,251,46,85,240>>) -> 53; > m(<<166,132,236,238,231,111,197,34,119,50,134,168,149,188,132,54>>)->54; > m(<<181,59,58,61,106,185,12,224,38,130,41,21,28,155,222,17>>) -> 55; > m(<<159,97,64,142,58,251,99,62,80,205,241,178,13,230,244,102>>) -> 56; > m(<<114,179,42,31,117,75,161,192,155,54,149,224,203,108,222,127>>) ->57; > m(<<102,240,65,225,106,96,146,139,5,167,226,40,168,156,55,153>>) -> 58; > m(<<9,63,101,224,128,162,149,248,7,107,28,87,34,164,106,162>>) -> 59; > m(<<7,43,3,11,161,38,178,244,178,55,79,52,43,233,237,68>>) -> 60; > m(<<127,57,248,49,127,189,177,152,142,244,198,40,235,160,37,145>>) ->61; > m(<<68,246,131,168,65,99,179,82,58,254,87,194,224,8,188,140>>) -> 62; > m(<<3,175,219,214,110,121,41,177,37,248,89,120,52,250,131,164>>) -> 63; > m(<<234,93,47,28,70,8,35,46,7,211,170,61,153,142,81,53>>) -> 64; > m(<<252,73,12,164,92,0,177,36,155,190,53,84,164,253,246,251>>) -> 65; > m(<<50,149,199,106,203,244,202,174,211,60,54,177,181,252,44,177>>) ->66; > m(<<115,91,144,180,86,129,37,237,108,63,103,136,25,182,224,88>>) -> 67; > m(<<163,243,144,216,142,76,65,242,116,123,250,47,27,95,135,219>>) -> 68; > m(<<20,191,166,187,20,135,94,69,187,160,40,162,30,211,128,70>>) -> 69; > m(<<124,187,196,9,236,153,15,25,199,140,117,189,30,6,242,21>>) -> 70; > m(<<226,196,32,217,40,212,191,140,224,255,46,193,155,55,21,20>>) -> 71; > m(<<50,187,144,232,151,106,171,82,152,213,218,16,254,102,242,29>>) ->72; > m(<<210,221,234,24,240,6,101,206,134,35,227,107,212,227,199,197>>) ->73; > m(<<173,97,171,20,50,35,239,188,36,199,210,88,59,230,146,81>>) -> 74; > m(<<208,155,244,21,68,163,54,90,70,201,7,126,187,94,53,195>>) -> 75; > m(<<251,215,147,157,103,73,151,205,180,105,45,52,222,134,51,196>>) ->76; > m(<<40,221,44,121,85,206,146,100,86,36,11,47,240,16,11,222>>) -> 77; > m(<<53,244,168,212,101,230,225,237,192,95,61,138,182,88,197,81>>) -> 78; > m(<<209,254,23,61,8,233,89,57,122,223,52,177,215,126,136,215>>) -> 79; > m(<<240,51,171,55,195,2,1,247,63,20,36,73,208,55,2,141>>) -> 80; > m(<<67,236,81,125,104,182,237,211,1,91,62,220,154,17,54,123>>) -> 81; > m(<<151,120,213,210,25,197,8,11,154,106,23,190,240,41,51,28>>) -> 82; > m(<<254,159,194,137,195,255,10,241,66,182,211,190,173,152,169,35>>)->83; > m(<<104,211,10,149,148,114,139,195,154,162,75,233,75,49,157,33>>) -> 84; > m(<<62,248,21,65,111,119,80,152,254,151,112,4,1,92,97,147>>) -> 85; > m(<<147,219,133,237,144,156,19,131,143,249,92,207,169,76,235,217>>)->86; > m(<<199,225,36,159,252,3,235,157,237,144,140,35,107,209,153,109>>) ->87; > m(<<42,56,164,169,49,108,73,229,168,51,81,124,69,211,16,112>>) -> 88; > m(<<118,71,150,107,115,67,194,144,72,103,50,82,228,144,247,54>>) -> 89; > m(<<134,19,152,94,196,158,184,247,87,174,100,57,232,121,187,42>>) -> 90; > m(<<84,34,154,191,207,165,100,158,112,3,184,61,212,117,82,148>>) -> 91; > m(<<146,204,34,117,50,209,126,86,224,121,2,178,84,223,173,16>>) -> 92; > m(<<152,220,232,61,165,123,3,149,225,99,70,124,157,174,82,27>>) -> 93; > m(<<244,185,236,48,173,159,104,248,155,41,99,151,134,203,98,239>>) ->94; > m(<<129,43,75,162,135,245,238,11,201,212,59,191,91,190,135,251>>) -> 95; > m(<<38,101,125,95,249,2,13,42,190,254,85,135,150,185,149,132>>) -> 96; > m(<<226,239,82,79,191,61,159,230,17,213,168,233,15,239,220,156>>) -> 97; > m(<<237,61,44,33,153,30,59,239,94,6,151,19,175,159,166,202>>) -> 98; > m(<<172,98,122,177,204,189,182,46,201,110,112,47,7,246,66,91>>) -> 99; > m(<<248,153,19,157,245,225,5,147,150,67,20,21,231,112,198,221>>) -> 100. > > crash() -> > lists:foreach( > fun(N) -> > io:format("Trying to match md5(~B)~n",[N]), > BinaryToMatch = erlang:md5(integer_to_list(N)), > > %% The pattern matching on binaries for this call > %% will cause buffer overflow. > m(BinaryToMatch) > end, > lists:seq(1,100) > ). > %%%%%%%%%%%% End buffer_overflow.erl %%%%%%%%%%%%% > > -- > Matthew Reilly > SIPphone Inc. > matthew.reilly@REDACTED > Gizmo Project name: matt > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://www.erlang.org/mailman/listinfo/erlang-bugs > -- Bj?rn Gustavsson, Erlang/OTP, Ericsson AB From matthias@REDACTED Fri Dec 22 08:58:48 2006 From: matthias@REDACTED (Matthias Lang) Date: Fri, 22 Dec 2006 08:58:48 +0100 Subject: [erlang-bugs] Buffer overflow in etrs_save_mb[] In-Reply-To: References: <1166654907.1243.229.camel@matt.sipphone.com> Message-ID: <17803.36920.746993.371437@antilipe.corelatus.se> Bjorn Gustavsson writes: > For R11B-3, I have changed the loader to reject bs_save/bs_restore > instruction with an index greater than 1023 if you load code > compiled by R10B. That left me wondering whether any of my existing code was a time bomb or not. So I wrote a little program to check for such opcodes in existing .beam files. It might be useful for others too. 1> check_bs:dir("/home/matthias"). Checking /home/matthias/buffer_overflow.beam ** exited: "beam file contains bs_save instruction with argument > 1023" ** Matthias ---------------------------------------------------------------------- %% Multiple versions of R11-B and R10B have a bug which results in %% the emulator corrupting its memory and then, probably, segfaulting. %% %% See erlang-bugs 2006-12-21 %% %% This module checks beam files to make sure they don't contain %% code that triggers the bug. Exits if it finds one. %% -module(check_bs). -export([file/1, files/1, dir/1]). %% check one beam file(Filename) -> io:fwrite("Checking ~s\n", [Filename]), Dis = beam_disasm:file(Filename), top_level(Dis), no_worries. files(Filenames) -> lists:foreach(fun file/1, Filenames), no_worries. %% check all beams in a given path dir(Path) -> {ok, Files} = file:list_dir(Path), Beams = [Path ++ "/" ++ X || X <- Files, maeb_si(lists:reverse(X))], files(Beams). maeb_si("maeb." ++ _) -> true; maeb_si(_) -> false. %%-------------------- top_level({beam_file, Chunks}) -> [Code] = [X || {code, X} <- Chunks], lists:foreach(fun function/1, Code). function({function, _Name, _Arity, _, Opcodes}) -> lists:foreach(fun opcode/1, Opcodes). opcode({bs_save, Arg}) when Arg < 1024 -> no_problem_mate; opcode({bs_save, _Arg}) -> exit("beam file contains bs_save instruction with argument > 1023"); opcode(_) -> do_nothing. %% eof From dmitry.kargapolov@REDACTED Fri Dec 22 20:52:27 2006 From: dmitry.kargapolov@REDACTED (Dmitriy Kargapolov) Date: Fri, 22 Dec 2006 14:52:27 -0500 Subject: [erlang-bugs] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)] In-Reply-To: <17553.35499.634400.359802@alkaid.it.uu.se> References: <44917D65.4040703@corp.idt.net> <17553.35499.634400.359802@alkaid.it.uu.se> Message-ID: <458C377B.3050603@corp.idt.net> Unfortunately I can not create standalone test for this bug, even when I became much more close to understanding the effect. This bug appears only in highly loaded system. Recently I did manage to trace some points in the code and see at least one scenario for the race condition bug. 1. Thread A erl_set_timer (time.c) Lock Timing Wheel 2. Thread A insert_timer (time.c) Insert Timer T1 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel 5. Thread A cancel_timer (erl_process.c) Cancel timer T1 6. Thread B bump_timer_internal (time.c) Build list of Expired Timers 7. Thread A erl_cancel_timer (time.c) Cancel timer T1: Waiting for Timing Wheel Lock 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel 9. Thread C set_timer (erl_process.c) New Timeout Request (T2) 10. Thread B bump_timer_internal (time.c) Call Expired Timers Callbacks 11. Thread B free_ptimer (utils.c) Timer T1 callback invokes free_ptimer() 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer ErtsSmpPTimer for T2 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer memory block 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer for T2, block reused! 15. Thread C erl_set_timer (time.c) erl_set_timer invoked for T2 16. Thread C erl_set_timer (time.c) Lock Timing Wheel 17. Thread C insert_timer (time.c) Insert Timer T2 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2 from the timing wheel 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel See also attached diagram. Looks like one more mutex required, excluding release of ErtsSmpPTimer memory block by timeout callback if cancel request was issued for the timer and vise versa. The two point of control - cancel timer and timer expiration should not interfere. This bug happens only in SMP mode since there additional timer control structure ErtsSmpPTimer is used between emulator and timing wheel. Mikael Pettersson wrote: > Dmitriy Kargapolov writes: > > > > When running erl with -smp +S 2 option, sometimes process gets stuck in > > timer:sleep/1. > > Process code looks like: > > > > some_receiver(State) -> > > NewState = receive > > % legal packet > > {some_keyword, Address, Port, Packet} -> > > State1 = handle_packet(Address, Port, Packet, State), > > timer:sleep(get_loop_delay()), > > State1; > > % unknown message > > _ -> > > State > > end, > > some_receiver(NewState). > > > > Delay value varies in range 1..999 > > > > Since timer:sleep/1 implemented as: > > sleep(T) -> > > receive > > after T -> ok > > end. > > it seems to be problem with "after" in smp implementation in R11B-0 > > > > I don't have more details yet but will continue testing. > > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux > > Interesting. Please send us a small standalone module that exhibits > the bug, and I'll see if I can reproduce it. > > /Mikael > -------------- next part -------------- A non-text attachment was scrubbed... Name: RaceCond.pdf Type: application/pdf Size: 16115 bytes Desc: not available URL: From toby@REDACTED Fri Dec 22 21:49:19 2006 From: toby@REDACTED (Toby Thain) Date: Fri, 22 Dec 2006 15:49:19 -0500 Subject: [erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)] In-Reply-To: <458C3E6B.5090509@hq.idt.net> References: <44917D65.4040703@corp.idt.net> <17553.35499.634400.359802@alkaid.it.uu.se> <458C377B.3050603@corp.idt.net> <458C3E6B.5090509@hq.idt.net> Message-ID: On 22-Dec-06, at 3:22 PM, Serge Aleynikov wrote: > Additionally, I should say that we've been able to reproduce this > bug on > several Linux platforms ... > It happens when all the CPUs in SMP mode are over 75% loaded. The bug > doesn't happen immediately after starting a release, but after a > period > of 5 min to 3 hours, which makes it pretty hard to diagnose. The > tracing method that we initially tried to use was to include printf > statements in the emulator to stderr. However, this prevented the bug > from showing up. Further it was changed to using SysV message > queue to > communicate trace to an external process that dumped the trace to a > file. This allowed to gain further understanding of the problem, > but as > Dmitry indicated any attempt to reduce the code to a minimal example > made the problem disappear. Could DTrace on Solaris help? --Toby From rickard.s.green@REDACTED Sun Dec 24 01:36:25 2006 From: rickard.s.green@REDACTED (Rickard Green) Date: Sun, 24 Dec 2006 01:36:25 +0100 Subject: [erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)] In-Reply-To: <458C3E6B.5090509@hq.idt.net> References: <44917D65.4040703@corp.idt.net> <17553.35499.634400.359802@alkaid.it.uu.se> <458C377B.3050603@corp.idt.net> <458C3E6B.5090509@hq.idt.net> Message-ID: <458DCB89.2090400@ericsson.com> Thanks for your detailed bug report. I'll look at this as soon as possible. BR, Rickard Green, Erlang/OTP Serge Aleynikov wrote: > Additionally, I should say that we've been able to reproduce this bug on > several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in > R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race > condition) is seen only if the emulator is started in the SMP mode and > results in the following construct blocking infinitely in the context of > some Erlang process handing a message dispatching function: > > receive > after N -> % Where N is between 1 and 999 > ok > end. > > It happens when all the CPUs in SMP mode are over 75% loaded. The bug > doesn't happen immediately after starting a release, but after a period > of 5 min to 3 hours, which makes it pretty hard to diagnose. The > tracing method that we initially tried to use was to include printf > statements in the emulator to stderr. However, this prevented the bug > from showing up. Further it was changed to using SysV message queue to > communicate trace to an external process that dumped the trace to a > file. This allowed to gain further understanding of the problem, but as > Dmitry indicated any attempt to reduce the code to a minimal example > made the problem disappear. > > The emulator code is quite involved, but hopefully someone in the OTP > team could come up with a recommendation of how/where to put a missing > synchronization. If needed we can arrange for a remote SSH login to the > system(s) where the problem is reproducible. > > Regards, > > Serge > > Dmitriy Kargapolov wrote: >> Unfortunately I can not create standalone test for this bug, even when I >> became much more close to understanding the effect. >> This bug appears only in highly loaded system. >> >> Recently I did manage to trace some points in the code and see at least >> one scenario for the race condition bug. >> >> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel >> 2. Thread A insert_timer (time.c) Insert Timer T1 >> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel >> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel >> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1 >> 6. Thread B bump_timer_internal (time.c) Build list of Expired >> Timers >> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1: >> Waiting for Timing Wheel Lock >> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel >> 9. Thread C set_timer (erl_process.c) New Timeout Request (T2) >> 10. Thread B bump_timer_internal (time.c) Call Expired Timers >> Callbacks >> 11. Thread B free_ptimer (utils.c) Timer T1 callback >> invokes free_ptimer() >> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer >> ErtsSmpPTimer for T2 >> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer >> memory block >> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer >> for T2, block reused! >> 15. Thread C erl_set_timer (time.c) erl_set_timer invoked >> for T2 >> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel >> 17. Thread C insert_timer (time.c) Insert Timer T2 >> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel >> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel >> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2 >> from the timing wheel >> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel >> >> See also attached diagram. >> >> Looks like one more mutex required, excluding release of ErtsSmpPTimer >> memory block by timeout callback if cancel request was issued for the >> timer and vise versa. The two point of control - cancel timer and timer >> expiration should not interfere. >> This bug happens only in SMP mode since there additional timer control >> structure ErtsSmpPTimer is used between emulator and timing wheel. >> >> Mikael Pettersson wrote: >>> Dmitriy Kargapolov writes: >>> > > When running erl with -smp +S 2 option, sometimes process gets >>> stuck in > timer:sleep/1. >>> > Process code looks like: >>> > > some_receiver(State) -> >>> > NewState = receive >>> > % legal packet >>> > {some_keyword, Address, Port, Packet} -> >>> > State1 = handle_packet(Address, Port, Packet, State), >>> > timer:sleep(get_loop_delay()), >>> > State1; >>> > % unknown message >>> > _ -> >>> > State >>> > end, >>> > some_receiver(NewState). >>> > > Delay value varies in range 1..999 >>> > > Since timer:sleep/1 implemented as: >>> > sleep(T) -> >>> > receive >>> > after T -> ok >>> > end. >>> > it seems to be problem with "after" in smp implementation in R11B-0 >>> > > I don't have more details yet but will continue testing. >>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux >>> >>> Interesting. Please send us a small standalone module that exhibits >>> the bug, and I'll see if I can reproduce it. >>> >>> /Mikael >>> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> erlang-questions mailing list >> erlang-questions@REDACTED >> http://www.erlang.org/mailman/listinfo/erlang-questions > From ayrnieu@REDACTED Wed Dec 27 14:56:24 2006 From: ayrnieu@REDACTED (Julian Fondren) Date: Wed, 27 Dec 2006 08:56:24 -0500 Subject: [erlang-bugs] Memory explosion in http_util:integer_to_hexlist/1 Message-ID: inets-4.7.6/src/http_util.erl contains this code: integer_to_hexlist(Num)-> integer_to_hexlist(Num, get_size(Num), []). Which should probably be: integer_to_hexlist(Num) when integer(Num) -> integer_to_hexlist(Num, get_size(Num), []). Thank you, Julian ps. do not do this: http_util:integer_to_hexlist([0]). From rickard.s.green@REDACTED Wed Dec 27 18:54:29 2006 From: rickard.s.green@REDACTED (Rickard Green) Date: Wed, 27 Dec 2006 18:54:29 +0100 Subject: [erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)] In-Reply-To: <458DCB89.2090400@ericsson.com> References: <44917D65.4040703@corp.idt.net> <17553.35499.634400.359802@alkaid.it.uu.se> <458C377B.3050603@corp.idt.net> <458C3E6B.5090509@hq.idt.net> <458DCB89.2090400@ericsson.com> Message-ID: <4592B355.4000901@ericsson.com> The process lock plays an important role here. Unfortunately a faulty optimization (blush) prevented the process lock from playing that role. ptimer_timeout() has to acquire the process lock before looking at the ptimer flags. I've attached a patch that should fix the problem. $ tar -zxf otp_src_R11B-2.tar.gz $ patch -p0 < ptimer.patch patching file `otp_src_R11B-2/erts/emulator/beam/utils.c' Please, report to us whether or not the problem went away. Great work Dmitriy and Serge! Many thanks! BR, Rickard Green, Erlang/OTP Rickard Green wrote: > Thanks for your detailed bug report. I'll look at this as soon as possible. > > BR, > Rickard Green, Erlang/OTP > > Serge Aleynikov wrote: >> Additionally, I should say that we've been able to reproduce this bug on >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race >> condition) is seen only if the emulator is started in the SMP mode and >> results in the following construct blocking infinitely in the context of >> some Erlang process handing a message dispatching function: >> >> receive >> after N -> % Where N is between 1 and 999 >> ok >> end. >> >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug >> doesn't happen immediately after starting a release, but after a period >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The >> tracing method that we initially tried to use was to include printf >> statements in the emulator to stderr. However, this prevented the bug >> from showing up. Further it was changed to using SysV message queue to >> communicate trace to an external process that dumped the trace to a >> file. This allowed to gain further understanding of the problem, but as >> Dmitry indicated any attempt to reduce the code to a minimal example >> made the problem disappear. >> >> The emulator code is quite involved, but hopefully someone in the OTP >> team could come up with a recommendation of how/where to put a missing >> synchronization. If needed we can arrange for a remote SSH login to the >> system(s) where the problem is reproducible. >> >> Regards, >> >> Serge >> >> Dmitriy Kargapolov wrote: >>> Unfortunately I can not create standalone test for this bug, even when I >>> became much more close to understanding the effect. >>> This bug appears only in highly loaded system. >>> >>> Recently I did manage to trace some points in the code and see at least >>> one scenario for the race condition bug. >>> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel >>> 2. Thread A insert_timer (time.c) Insert Timer T1 >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1 >>> 6. Thread B bump_timer_internal (time.c) Build list of Expired >>> Timers >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1: >>> Waiting for Timing Wheel Lock >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel >>> 9. Thread C set_timer (erl_process.c) New Timeout Request (T2) >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers >>> Callbacks >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback >>> invokes free_ptimer() >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer >>> ErtsSmpPTimer for T2 >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer >>> memory block >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer >>> for T2, block reused! >>> 15. Thread C erl_set_timer (time.c) erl_set_timer invoked >>> for T2 >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel >>> 17. Thread C insert_timer (time.c) Insert Timer T2 >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2 >>> from the timing wheel >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel >>> >>> See also attached diagram. >>> >>> Looks like one more mutex required, excluding release of ErtsSmpPTimer >>> memory block by timeout callback if cancel request was issued for the >>> timer and vise versa. The two point of control - cancel timer and timer >>> expiration should not interfere. >>> This bug happens only in SMP mode since there additional timer control >>> structure ErtsSmpPTimer is used between emulator and timing wheel. >>> >>> Mikael Pettersson wrote: >>>> Dmitriy Kargapolov writes: >>>> > > When running erl with -smp +S 2 option, sometimes process gets >>>> stuck in > timer:sleep/1. >>>> > Process code looks like: >>>> > > some_receiver(State) -> >>>> > NewState = receive >>>> > % legal packet >>>> > {some_keyword, Address, Port, Packet} -> >>>> > State1 = handle_packet(Address, Port, Packet, State), >>>> > timer:sleep(get_loop_delay()), >>>> > State1; >>>> > % unknown message >>>> > _ -> >>>> > State >>>> > end, >>>> > some_receiver(NewState). >>>> > > Delay value varies in range 1..999 >>>> > > Since timer:sleep/1 implemented as: >>>> > sleep(T) -> >>>> > receive >>>> > after T -> ok >>>> > end. >>>> > it seems to be problem with "after" in smp implementation in R11B-0 >>>> > > I don't have more details yet but will continue testing. >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux >>>> >>>> Interesting. Please send us a small standalone module that exhibits >>>> the bug, and I'll see if I can reproduce it. >>>> >>>> /Mikael >>>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> erlang-questions mailing list >>> erlang-questions@REDACTED >>> http://www.erlang.org/mailman/listinfo/erlang-questions > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://www.erlang.org/mailman/listinfo/erlang-bugs > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ptimer.patch URL: From chris.newcombe@REDACTED Thu Dec 28 17:09:18 2006 From: chris.newcombe@REDACTED (Chris Newcombe) Date: Thu, 28 Dec 2006 08:09:18 -0800 Subject: [erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)] In-Reply-To: <4592B355.4000901@ericsson.com> References: <44917D65.4040703@corp.idt.net> <17553.35499.634400.359802@alkaid.it.uu.se> <458C377B.3050603@corp.idt.net> <458C3E6B.5090509@hq.idt.net> <458DCB89.2090400@ericsson.com> <4592B355.4000901@ericsson.com> Message-ID: <781dd98c0612280809y2ec6554gd9f8c5397377781f@mail.gmail.com> Hi Rickard, First of all, many thanks indeed for the very fast response time on investigating and fixing issues like this! That level of responsiveness really helps reassure new adopters of Erlang. How risky is this patch? i.e. Should everyone apply it? Is the patch ... a) An experimental fix that needs testing by Serge and Dmitriy before others consider it. b) A definite fix for a definite problem, and has been tested. But it may or may not be the problem that Serge and Dmitriy found. regards, Chris On 12/27/06, Rickard Green wrote: > The process lock plays an important role here. Unfortunately a faulty > optimization (blush) prevented the process lock from playing that role. > ptimer_timeout() has to acquire the process lock before looking at the > ptimer flags. I've attached a patch that should fix the problem. > > $ tar -zxf otp_src_R11B-2.tar.gz > $ patch -p0 < ptimer.patch > patching file `otp_src_R11B-2/erts/emulator/beam/utils.c' > > Please, report to us whether or not the problem went away. > > Great work Dmitriy and Serge! Many thanks! > > BR, > Rickard Green, Erlang/OTP > > Rickard Green wrote: > > Thanks for your detailed bug report. I'll look at this as soon as possible. > > > > BR, > > Rickard Green, Erlang/OTP > > > > Serge Aleynikov wrote: > >> Additionally, I should say that we've been able to reproduce this bug on > >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in > >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race > >> condition) is seen only if the emulator is started in the SMP mode and > >> results in the following construct blocking infinitely in the context of > >> some Erlang process handing a message dispatching function: > >> > >> receive > >> after N -> % Where N is between 1 and 999 > >> ok > >> end. > >> > >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug > >> doesn't happen immediately after starting a release, but after a period > >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The > >> tracing method that we initially tried to use was to include printf > >> statements in the emulator to stderr. However, this prevented the bug > >> from showing up. Further it was changed to using SysV message queue to > >> communicate trace to an external process that dumped the trace to a > >> file. This allowed to gain further understanding of the problem, but as > >> Dmitry indicated any attempt to reduce the code to a minimal example > >> made the problem disappear. > >> > >> The emulator code is quite involved, but hopefully someone in the OTP > >> team could come up with a recommendation of how/where to put a missing > >> synchronization. If needed we can arrange for a remote SSH login to the > >> system(s) where the problem is reproducible. > >> > >> Regards, > >> > >> Serge > >> > >> Dmitriy Kargapolov wrote: > >>> Unfortunately I can not create standalone test for this bug, even when I > >>> became much more close to understanding the effect. > >>> This bug appears only in highly loaded system. > >>> > >>> Recently I did manage to trace some points in the code and see at least > >>> one scenario for the race condition bug. > >>> > >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel > >>> 2. Thread A insert_timer (time.c) Insert Timer T1 > >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel > >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel > >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1 > >>> 6. Thread B bump_timer_internal (time.c) Build list of Expired > >>> Timers > >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1: > >>> Waiting for Timing Wheel Lock > >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel > >>> 9. Thread C set_timer (erl_process.c) New Timeout Request (T2) > >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers > >>> Callbacks > >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback > >>> invokes free_ptimer() > >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer > >>> ErtsSmpPTimer for T2 > >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer > >>> memory block > >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer > >>> for T2, block reused! > >>> 15. Thread C erl_set_timer (time.c) erl_set_timer invoked > >>> for T2 > >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel > >>> 17. Thread C insert_timer (time.c) Insert Timer T2 > >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel > >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel > >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2 > >>> from the timing wheel > >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel > >>> > >>> See also attached diagram. > >>> > >>> Looks like one more mutex required, excluding release of ErtsSmpPTimer > >>> memory block by timeout callback if cancel request was issued for the > >>> timer and vise versa. The two point of control - cancel timer and timer > >>> expiration should not interfere. > >>> This bug happens only in SMP mode since there additional timer control > >>> structure ErtsSmpPTimer is used between emulator and timing wheel. > >>> > >>> Mikael Pettersson wrote: > >>>> Dmitriy Kargapolov writes: > >>>> > > When running erl with -smp +S 2 option, sometimes process gets > >>>> stuck in > timer:sleep/1. > >>>> > Process code looks like: > >>>> > > some_receiver(State) -> > >>>> > NewState = receive > >>>> > % legal packet > >>>> > {some_keyword, Address, Port, Packet} -> > >>>> > State1 = handle_packet(Address, Port, Packet, State), > >>>> > timer:sleep(get_loop_delay()), > >>>> > State1; > >>>> > % unknown message > >>>> > _ -> > >>>> > State > >>>> > end, > >>>> > some_receiver(NewState). > >>>> > > Delay value varies in range 1..999 > >>>> > > Since timer:sleep/1 implemented as: > >>>> > sleep(T) -> > >>>> > receive > >>>> > after T -> ok > >>>> > end. > >>>> > it seems to be problem with "after" in smp implementation in R11B-0 > >>>> > > I don't have more details yet but will continue testing. > >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux > >>>> > >>>> Interesting. Please send us a small standalone module that exhibits > >>>> the bug, and I'll see if I can reproduce it. > >>>> > >>>> /Mikael > >>>> > >>> ------------------------------------------------------------------------ > >>> > >>> _______________________________________________ > >>> erlang-questions mailing list > >>> erlang-questions@REDACTED > >>> http://www.erlang.org/mailman/listinfo/erlang-questions > > _______________________________________________ > > erlang-bugs mailing list > > erlang-bugs@REDACTED > > http://www.erlang.org/mailman/listinfo/erlang-bugs > > > > > > > > --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06 14:51:50.000000000 +0100 > +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c 2006-12-27 18:11:44.772758000 +0100 > @@ -2999,15 +2999,16 @@ > static void > ptimer_timeout(ErtsSmpPTimer *ptimer) > { > - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { > if (is_internal_pid(ptimer->timer.id)) { > Process *p; > - p = erts_pid2proc(NULL, > - 0, > - ptimer->timer.id, > - ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS); > + p = erts_pid2proc_opt(NULL, > + 0, > + ptimer->timer.id, > + ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS, > + ERTS_P2P_FLG_ALLOW_OTHER_X); > if (p) { > - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { > + if (!p->is_exiting > + && !(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { > ASSERT(*ptimer->timer.timer_ref == ptimer); > *ptimer->timer.timer_ref = NULL; > (*ptimer->timer.timeout_func)(p); > @@ -3028,7 +3029,6 @@ > erts_smp_io_unlock(); > } > } > - } > free_ptimer(ptimer); > } > > > > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://www.erlang.org/mailman/listinfo/erlang-bugs > > > From rickard.s.green@REDACTED Fri Dec 29 14:33:46 2006 From: rickard.s.green@REDACTED (Rickard Green) Date: Fri, 29 Dec 2006 14:33:46 +0100 Subject: [erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)] In-Reply-To: <781dd98c0612280809y2ec6554gd9f8c5397377781f@mail.gmail.com> References: <44917D65.4040703@corp.idt.net> <17553.35499.634400.359802@alkaid.it.uu.se> <458C377B.3050603@corp.idt.net> <458C3E6B.5090509@hq.idt.net> <458DCB89.2090400@ericsson.com> <4592B355.4000901@ericsson.com> <781dd98c0612280809y2ec6554gd9f8c5397377781f@mail.gmail.com> Message-ID: <4595193A.8040902@ericsson.com> The scenario described by Serge and Dmitriy can happen due to this bug. The fix has been tested and I am quite sure it will fix the described problem. There could of course exist yet another bug causing the same problem, but I don't think so. The results of Serge's and Dmitriy's tests are of course interesting, but regardless that the patch fixes a real bug. If you use the smp emulator, apply the patch. BR, Rickard Green, Erlang/OTP Chris Newcombe wrote: > Hi Rickard, > > First of all, many thanks indeed for the very fast response time on > investigating and fixing issues like this! That level of > responsiveness really helps reassure new adopters of Erlang. > > How risky is this patch? i.e. Should everyone apply it? > > Is the patch ... > > a) An experimental fix that needs testing by Serge and Dmitriy before > others consider it. > > b) A definite fix for a definite problem, and has been tested. But > it may or may not be the problem that Serge and Dmitriy found. > > regards, > > Chris > > On 12/27/06, Rickard Green wrote: >> The process lock plays an important role here. Unfortunately a faulty >> optimization (blush) prevented the process lock from playing that role. >> ptimer_timeout() has to acquire the process lock before looking at the >> ptimer flags. I've attached a patch that should fix the problem. >> >> $ tar -zxf otp_src_R11B-2.tar.gz >> $ patch -p0 < ptimer.patch >> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c' >> >> Please, report to us whether or not the problem went away. >> >> Great work Dmitriy and Serge! Many thanks! >> >> BR, >> Rickard Green, Erlang/OTP >> >> Rickard Green wrote: >> > Thanks for your detailed bug report. I'll look at this as soon as >> possible. >> > >> > BR, >> > Rickard Green, Erlang/OTP >> > >> > Serge Aleynikov wrote: >> >> Additionally, I should say that we've been able to reproduce this >> bug on >> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and >> 64bit) in >> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race >> >> condition) is seen only if the emulator is started in the SMP mode and >> >> results in the following construct blocking infinitely in the >> context of >> >> some Erlang process handing a message dispatching function: >> >> >> >> receive >> >> after N -> % Where N is between 1 and 999 >> >> ok >> >> end. >> >> >> >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug >> >> doesn't happen immediately after starting a release, but after a >> period >> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The >> >> tracing method that we initially tried to use was to include printf >> >> statements in the emulator to stderr. However, this prevented the bug >> >> from showing up. Further it was changed to using SysV message >> queue to >> >> communicate trace to an external process that dumped the trace to a >> >> file. This allowed to gain further understanding of the problem, >> but as >> >> Dmitry indicated any attempt to reduce the code to a minimal example >> >> made the problem disappear. >> >> >> >> The emulator code is quite involved, but hopefully someone in the OTP >> >> team could come up with a recommendation of how/where to put a missing >> >> synchronization. If needed we can arrange for a remote SSH login >> to the >> >> system(s) where the problem is reproducible. >> >> >> >> Regards, >> >> >> >> Serge >> >> >> >> Dmitriy Kargapolov wrote: >> >>> Unfortunately I can not create standalone test for this bug, even >> when I >> >>> became much more close to understanding the effect. >> >>> This bug appears only in highly loaded system. >> >>> >> >>> Recently I did manage to trace some points in the code and see at >> least >> >>> one scenario for the race condition bug. >> >>> >> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel >> >>> 2. Thread A insert_timer (time.c) Insert Timer T1 >> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel >> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel >> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1 >> >>> 6. Thread B bump_timer_internal (time.c) Build list of >> Expired >> >>> Timers >> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1: >> >>> Waiting for Timing Wheel Lock >> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel >> >>> 9. Thread C set_timer (erl_process.c) New Timeout >> Request (T2) >> >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers >> >>> Callbacks >> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback >> >>> invokes free_ptimer() >> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer >> >>> ErtsSmpPTimer for T2 >> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer >> >>> memory block >> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate >> ErtsSmpPTimer >> >>> for T2, block reused! >> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer >> invoked >> >>> for T2 >> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel >> >>> 17. Thread C insert_timer (time.c) Insert Timer T2 >> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel >> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel >> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2 >> >>> from the timing wheel >> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel >> >>> >> >>> See also attached diagram. >> >>> >> >>> Looks like one more mutex required, excluding release of >> ErtsSmpPTimer >> >>> memory block by timeout callback if cancel request was issued for the >> >>> timer and vise versa. The two point of control - cancel timer and >> timer >> >>> expiration should not interfere. >> >>> This bug happens only in SMP mode since there additional timer >> control >> >>> structure ErtsSmpPTimer is used between emulator and timing wheel. >> >>> >> >>> Mikael Pettersson wrote: >> >>>> Dmitriy Kargapolov writes: >> >>>> > > When running erl with -smp +S 2 option, sometimes process gets >> >>>> stuck in > timer:sleep/1. >> >>>> > Process code looks like: >> >>>> > > some_receiver(State) -> >> >>>> > NewState = receive >> >>>> > % legal packet >> >>>> > {some_keyword, Address, Port, Packet} -> >> >>>> > State1 = handle_packet(Address, Port, Packet, >> State), >> >>>> > timer:sleep(get_loop_delay()), >> >>>> > State1; >> >>>> > % unknown message >> >>>> > _ -> >> >>>> > State >> >>>> > end, >> >>>> > some_receiver(NewState). >> >>>> > > Delay value varies in range 1..999 >> >>>> > > Since timer:sleep/1 implemented as: >> >>>> > sleep(T) -> >> >>>> > receive >> >>>> > after T -> ok >> >>>> > end. >> >>>> > it seems to be problem with "after" in smp implementation in >> R11B-0 >> >>>> > > I don't have more details yet but will continue testing. >> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux >> >>>> >> >>>> Interesting. Please send us a small standalone module that exhibits >> >>>> the bug, and I'll see if I can reproduce it. >> >>>> >> >>>> /Mikael >> >>>> >> >>> >> ------------------------------------------------------------------------ >> >>> >> >>> _______________________________________________ >> >>> erlang-questions mailing list >> >>> erlang-questions@REDACTED >> >>> http://www.erlang.org/mailman/listinfo/erlang-questions >> > _______________________________________________ >> > erlang-bugs mailing list >> > erlang-bugs@REDACTED >> > http://www.erlang.org/mailman/listinfo/erlang-bugs >> > >> >> >> >> >> >> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06 >> 14:51:50.000000000 +0100 >> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c >> 2006-12-27 18:11:44.772758000 +0100 >> @@ -2999,15 +2999,16 @@ >> static void >> ptimer_timeout(ErtsSmpPTimer *ptimer) >> { >> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { >> if (is_internal_pid(ptimer->timer.id)) { >> Process *p; >> - p = erts_pid2proc(NULL, >> - 0, >> - ptimer->timer.id, >> - ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS); >> + p = erts_pid2proc_opt(NULL, >> + 0, >> + ptimer->timer.id, >> + >> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS, >> + ERTS_P2P_FLG_ALLOW_OTHER_X); >> if (p) { >> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { >> + if (!p->is_exiting >> + && !(ptimer->timer.flags & >> ERTS_PTMR_FLG_CANCELLED)) { >> ASSERT(*ptimer->timer.timer_ref == ptimer); >> *ptimer->timer.timer_ref = NULL; >> (*ptimer->timer.timeout_func)(p); >> @@ -3028,7 +3029,6 @@ >> erts_smp_io_unlock(); >> } >> } >> - } >> free_ptimer(ptimer); >> } >> >> >> >> >> _______________________________________________ >> erlang-bugs mailing list >> erlang-bugs@REDACTED >> http://www.erlang.org/mailman/listinfo/erlang-bugs >> >> >> > From dmitry.kargapolov@REDACTED Fri Dec 29 17:15:50 2006 From: dmitry.kargapolov@REDACTED (Dmitriy Kargapolov) Date: Fri, 29 Dec 2006 11:15:50 -0500 Subject: [erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)] In-Reply-To: <4595193A.8040902@ericsson.com> References: <44917D65.4040703@corp.idt.net> <17553.35499.634400.359802@alkaid.it.uu.se> <458C377B.3050603@corp.idt.net> <458C3E6B.5090509@hq.idt.net> <458DCB89.2090400@ericsson.com> <4592B355.4000901@ericsson.com> <781dd98c0612280809y2ec6554gd9f8c5397377781f@mail.gmail.com> <4595193A.8040902@ericsson.com> Message-ID: <45953F36.1050301@corp.idt.net> First test of the patch has been successfully completed. We will continue testing after the holiday, but so far patch works fine. Thank you very much for fixing the problem in so short time! BR and Happy New Year! Rickard Green wrote: > The scenario described by Serge and Dmitriy can happen due to this bug. > The fix has been tested and I am quite sure it will fix the described > problem. There could of course exist yet another bug causing the same > problem, but I don't think so. The results of Serge's and Dmitriy's > tests are of course interesting, but regardless that the patch fixes a > real bug. If you use the smp emulator, apply the patch. > > BR, > Rickard Green, Erlang/OTP > > Chris Newcombe wrote: >> Hi Rickard, >> >> First of all, many thanks indeed for the very fast response time on >> investigating and fixing issues like this! That level of >> responsiveness really helps reassure new adopters of Erlang. >> >> How risky is this patch? i.e. Should everyone apply it? >> >> Is the patch ... >> >> a) An experimental fix that needs testing by Serge and Dmitriy before >> others consider it. >> >> b) A definite fix for a definite problem, and has been tested. But >> it may or may not be the problem that Serge and Dmitriy found. >> >> regards, >> >> Chris >> >> On 12/27/06, Rickard Green wrote: >>> The process lock plays an important role here. Unfortunately a faulty >>> optimization (blush) prevented the process lock from playing that role. >>> ptimer_timeout() has to acquire the process lock before looking at the >>> ptimer flags. I've attached a patch that should fix the problem. >>> >>> $ tar -zxf otp_src_R11B-2.tar.gz >>> $ patch -p0 < ptimer.patch >>> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c' >>> >>> Please, report to us whether or not the problem went away. >>> >>> Great work Dmitriy and Serge! Many thanks! >>> >>> BR, >>> Rickard Green, Erlang/OTP >>> >>> Rickard Green wrote: >>> > Thanks for your detailed bug report. I'll look at this as soon as >>> possible. >>> > >>> > BR, >>> > Rickard Green, Erlang/OTP >>> > >>> > Serge Aleynikov wrote: >>> >> Additionally, I should say that we've been able to reproduce this >>> bug on >>> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and >>> 64bit) in >>> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race >>> >> condition) is seen only if the emulator is started in the SMP mode >>> and >>> >> results in the following construct blocking infinitely in the >>> context of >>> >> some Erlang process handing a message dispatching function: >>> >> >>> >> receive >>> >> after N -> % Where N is between 1 and 999 >>> >> ok >>> >> end. >>> >> >>> >> It happens when all the CPUs in SMP mode are over 75% loaded. The >>> bug >>> >> doesn't happen immediately after starting a release, but after a >>> period >>> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The >>> >> tracing method that we initially tried to use was to include printf >>> >> statements in the emulator to stderr. However, this prevented the >>> bug >>> >> from showing up. Further it was changed to using SysV message >>> queue to >>> >> communicate trace to an external process that dumped the trace to a >>> >> file. This allowed to gain further understanding of the problem, >>> but as >>> >> Dmitry indicated any attempt to reduce the code to a minimal example >>> >> made the problem disappear. >>> >> >>> >> The emulator code is quite involved, but hopefully someone in the OTP >>> >> team could come up with a recommendation of how/where to put a >>> missing >>> >> synchronization. If needed we can arrange for a remote SSH login >>> to the >>> >> system(s) where the problem is reproducible. >>> >> >>> >> Regards, >>> >> >>> >> Serge >>> >> >>> >> Dmitriy Kargapolov wrote: >>> >>> Unfortunately I can not create standalone test for this bug, even >>> when I >>> >>> became much more close to understanding the effect. >>> >>> This bug appears only in highly loaded system. >>> >>> >>> >>> Recently I did manage to trace some points in the code and see at >>> least >>> >>> one scenario for the race condition bug. >>> >>> >>> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel >>> >>> 2. Thread A insert_timer (time.c) Insert Timer T1 >>> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing >>> Wheel >>> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel >>> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1 >>> >>> 6. Thread B bump_timer_internal (time.c) Build list of >>> Expired >>> >>> Timers >>> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1: >>> >>> Waiting for Timing Wheel Lock >>> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing >>> Wheel >>> >>> 9. Thread C set_timer (erl_process.c) New Timeout >>> Request (T2) >>> >>> 10. Thread B bump_timer_internal (time.c) Call Expired >>> Timers >>> >>> Callbacks >>> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback >>> >>> invokes free_ptimer() >>> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer >>> >>> ErtsSmpPTimer for T2 >>> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer >>> >>> memory block >>> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate >>> ErtsSmpPTimer >>> >>> for T2, block reused! >>> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer >>> invoked >>> >>> for T2 >>> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel >>> >>> 17. Thread C insert_timer (time.c) Insert Timer T2 >>> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing >>> Wheel >>> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel >>> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2 >>> >>> from the timing wheel >>> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing >>> Wheel >>> >>> >>> >>> See also attached diagram. >>> >>> >>> >>> Looks like one more mutex required, excluding release of >>> ErtsSmpPTimer >>> >>> memory block by timeout callback if cancel request was issued for >>> the >>> >>> timer and vise versa. The two point of control - cancel timer and >>> timer >>> >>> expiration should not interfere. >>> >>> This bug happens only in SMP mode since there additional timer >>> control >>> >>> structure ErtsSmpPTimer is used between emulator and timing wheel. >>> >>> >>> >>> Mikael Pettersson wrote: >>> >>>> Dmitriy Kargapolov writes: >>> >>>> > > When running erl with -smp +S 2 option, sometimes process >>> gets >>> >>>> stuck in > timer:sleep/1. >>> >>>> > Process code looks like: >>> >>>> > > some_receiver(State) -> >>> >>>> > NewState = receive >>> >>>> > % legal packet >>> >>>> > {some_keyword, Address, Port, Packet} -> >>> >>>> > State1 = handle_packet(Address, Port, Packet, >>> State), >>> >>>> > timer:sleep(get_loop_delay()), >>> >>>> > State1; >>> >>>> > % unknown message >>> >>>> > _ -> >>> >>>> > State >>> >>>> > end, >>> >>>> > some_receiver(NewState). >>> >>>> > > Delay value varies in range 1..999 >>> >>>> > > Since timer:sleep/1 implemented as: >>> >>>> > sleep(T) -> >>> >>>> > receive >>> >>>> > after T -> ok >>> >>>> > end. >>> >>>> > it seems to be problem with "after" in smp implementation in >>> R11B-0 >>> >>>> > > I don't have more details yet but will continue testing. >>> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux >>> >>>> >>> >>>> Interesting. Please send us a small standalone module that exhibits >>> >>>> the bug, and I'll see if I can reproduce it. >>> >>>> >>> >>>> /Mikael >>> >>>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> >>> >>> _______________________________________________ >>> >>> erlang-questions mailing list >>> >>> erlang-questions@REDACTED >>> >>> http://www.erlang.org/mailman/listinfo/erlang-questions >>> > _______________________________________________ >>> > erlang-bugs mailing list >>> > erlang-bugs@REDACTED >>> > http://www.erlang.org/mailman/listinfo/erlang-bugs >>> > >>> >>> >>> >>> >>> >>> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06 >>> 14:51:50.000000000 +0100 >>> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c >>> 2006-12-27 18:11:44.772758000 +0100 >>> @@ -2999,15 +2999,16 @@ >>> static void >>> ptimer_timeout(ErtsSmpPTimer *ptimer) >>> { >>> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { >>> if (is_internal_pid(ptimer->timer.id)) { >>> Process *p; >>> - p = erts_pid2proc(NULL, >>> - 0, >>> - ptimer->timer.id, >>> - >>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS); >>> + p = erts_pid2proc_opt(NULL, >>> + 0, >>> + ptimer->timer.id, >>> + >>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS, >>> + ERTS_P2P_FLG_ALLOW_OTHER_X); >>> if (p) { >>> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { >>> + if (!p->is_exiting >>> + && !(ptimer->timer.flags & >>> ERTS_PTMR_FLG_CANCELLED)) { >>> ASSERT(*ptimer->timer.timer_ref == ptimer); >>> *ptimer->timer.timer_ref = NULL; >>> (*ptimer->timer.timeout_func)(p); >>> @@ -3028,7 +3029,6 @@ >>> erts_smp_io_unlock(); >>> } >>> } >>> - } >>> free_ptimer(ptimer); >>> } >>> >>> >>> >>> >>> _______________________________________________ >>> erlang-bugs mailing list >>> erlang-bugs@REDACTED >>> http://www.erlang.org/mailman/listinfo/erlang-bugs >>> >>> >>> >> > From chris.newcombe@REDACTED Fri Dec 29 17:50:30 2006 From: chris.newcombe@REDACTED (Chris Newcombe) Date: Fri, 29 Dec 2006 08:50:30 -0800 Subject: [erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)] In-Reply-To: <4595193A.8040902@ericsson.com> References: <44917D65.4040703@corp.idt.net> <17553.35499.634400.359802@alkaid.it.uu.se> <458C377B.3050603@corp.idt.net> <458C3E6B.5090509@hq.idt.net> <458DCB89.2090400@ericsson.com> <4592B355.4000901@ericsson.com> <781dd98c0612280809y2ec6554gd9f8c5397377781f@mail.gmail.com> <4595193A.8040902@ericsson.com> Message-ID: <781dd98c0612290850y8a15767s832bfef04fc5aeff@mail.gmail.com> Excellent -- many thanks again for fixing it so quickly. Chris On 12/29/06, Rickard Green wrote: > The scenario described by Serge and Dmitriy can happen due to this bug. > The fix has been tested and I am quite sure it will fix the described > problem. There could of course exist yet another bug causing the same > problem, but I don't think so. The results of Serge's and Dmitriy's > tests are of course interesting, but regardless that the patch fixes a > real bug. If you use the smp emulator, apply the patch. > > BR, > Rickard Green, Erlang/OTP > > Chris Newcombe wrote: > > Hi Rickard, > > > > First of all, many thanks indeed for the very fast response time on > > investigating and fixing issues like this! That level of > > responsiveness really helps reassure new adopters of Erlang. > > > > How risky is this patch? i.e. Should everyone apply it? > > > > Is the patch ... > > > > a) An experimental fix that needs testing by Serge and Dmitriy before > > others consider it. > > > > b) A definite fix for a definite problem, and has been tested. But > > it may or may not be the problem that Serge and Dmitriy found. > > > > regards, > > > > Chris > > > > On 12/27/06, Rickard Green wrote: > >> The process lock plays an important role here. Unfortunately a faulty > >> optimization (blush) prevented the process lock from playing that role. > >> ptimer_timeout() has to acquire the process lock before looking at the > >> ptimer flags. I've attached a patch that should fix the problem. > >> > >> $ tar -zxf otp_src_R11B-2.tar.gz > >> $ patch -p0 < ptimer.patch > >> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c' > >> > >> Please, report to us whether or not the problem went away. > >> > >> Great work Dmitriy and Serge! Many thanks! > >> > >> BR, > >> Rickard Green, Erlang/OTP > >> > >> Rickard Green wrote: > >> > Thanks for your detailed bug report. I'll look at this as soon as > >> possible. > >> > > >> > BR, > >> > Rickard Green, Erlang/OTP > >> > > >> > Serge Aleynikov wrote: > >> >> Additionally, I should say that we've been able to reproduce this > >> bug on > >> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and > >> 64bit) in > >> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race > >> >> condition) is seen only if the emulator is started in the SMP mode and > >> >> results in the following construct blocking infinitely in the > >> context of > >> >> some Erlang process handing a message dispatching function: > >> >> > >> >> receive > >> >> after N -> % Where N is between 1 and 999 > >> >> ok > >> >> end. > >> >> > >> >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug > >> >> doesn't happen immediately after starting a release, but after a > >> period > >> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The > >> >> tracing method that we initially tried to use was to include printf > >> >> statements in the emulator to stderr. However, this prevented the bug > >> >> from showing up. Further it was changed to using SysV message > >> queue to > >> >> communicate trace to an external process that dumped the trace to a > >> >> file. This allowed to gain further understanding of the problem, > >> but as > >> >> Dmitry indicated any attempt to reduce the code to a minimal example > >> >> made the problem disappear. > >> >> > >> >> The emulator code is quite involved, but hopefully someone in the OTP > >> >> team could come up with a recommendation of how/where to put a missing > >> >> synchronization. If needed we can arrange for a remote SSH login > >> to the > >> >> system(s) where the problem is reproducible. > >> >> > >> >> Regards, > >> >> > >> >> Serge > >> >> > >> >> Dmitriy Kargapolov wrote: > >> >>> Unfortunately I can not create standalone test for this bug, even > >> when I > >> >>> became much more close to understanding the effect. > >> >>> This bug appears only in highly loaded system. > >> >>> > >> >>> Recently I did manage to trace some points in the code and see at > >> least > >> >>> one scenario for the race condition bug. > >> >>> > >> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel > >> >>> 2. Thread A insert_timer (time.c) Insert Timer T1 > >> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel > >> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel > >> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1 > >> >>> 6. Thread B bump_timer_internal (time.c) Build list of > >> Expired > >> >>> Timers > >> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1: > >> >>> Waiting for Timing Wheel Lock > >> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel > >> >>> 9. Thread C set_timer (erl_process.c) New Timeout > >> Request (T2) > >> >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers > >> >>> Callbacks > >> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback > >> >>> invokes free_ptimer() > >> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer > >> >>> ErtsSmpPTimer for T2 > >> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer > >> >>> memory block > >> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate > >> ErtsSmpPTimer > >> >>> for T2, block reused! > >> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer > >> invoked > >> >>> for T2 > >> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel > >> >>> 17. Thread C insert_timer (time.c) Insert Timer T2 > >> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel > >> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel > >> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2 > >> >>> from the timing wheel > >> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel > >> >>> > >> >>> See also attached diagram. > >> >>> > >> >>> Looks like one more mutex required, excluding release of > >> ErtsSmpPTimer > >> >>> memory block by timeout callback if cancel request was issued for the > >> >>> timer and vise versa. The two point of control - cancel timer and > >> timer > >> >>> expiration should not interfere. > >> >>> This bug happens only in SMP mode since there additional timer > >> control > >> >>> structure ErtsSmpPTimer is used between emulator and timing wheel. > >> >>> > >> >>> Mikael Pettersson wrote: > >> >>>> Dmitriy Kargapolov writes: > >> >>>> > > When running erl with -smp +S 2 option, sometimes process gets > >> >>>> stuck in > timer:sleep/1. > >> >>>> > Process code looks like: > >> >>>> > > some_receiver(State) -> > >> >>>> > NewState = receive > >> >>>> > % legal packet > >> >>>> > {some_keyword, Address, Port, Packet} -> > >> >>>> > State1 = handle_packet(Address, Port, Packet, > >> State), > >> >>>> > timer:sleep(get_loop_delay()), > >> >>>> > State1; > >> >>>> > % unknown message > >> >>>> > _ -> > >> >>>> > State > >> >>>> > end, > >> >>>> > some_receiver(NewState). > >> >>>> > > Delay value varies in range 1..999 > >> >>>> > > Since timer:sleep/1 implemented as: > >> >>>> > sleep(T) -> > >> >>>> > receive > >> >>>> > after T -> ok > >> >>>> > end. > >> >>>> > it seems to be problem with "after" in smp implementation in > >> R11B-0 > >> >>>> > > I don't have more details yet but will continue testing. > >> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux > >> >>>> > >> >>>> Interesting. Please send us a small standalone module that exhibits > >> >>>> the bug, and I'll see if I can reproduce it. > >> >>>> > >> >>>> /Mikael > >> >>>> > >> >>> > >> ------------------------------------------------------------------------ > >> >>> > >> >>> _______________________________________________ > >> >>> erlang-questions mailing list > >> >>> erlang-questions@REDACTED > >> >>> http://www.erlang.org/mailman/listinfo/erlang-questions > >> > _______________________________________________ > >> > erlang-bugs mailing list > >> > erlang-bugs@REDACTED > >> > http://www.erlang.org/mailman/listinfo/erlang-bugs > >> > > >> > >> > >> > >> > >> > >> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06 > >> 14:51:50.000000000 +0100 > >> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c > >> 2006-12-27 18:11:44.772758000 +0100 > >> @@ -2999,15 +2999,16 @@ > >> static void > >> ptimer_timeout(ErtsSmpPTimer *ptimer) > >> { > >> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { > >> if (is_internal_pid(ptimer->timer.id)) { > >> Process *p; > >> - p = erts_pid2proc(NULL, > >> - 0, > >> - ptimer->timer.id, > >> - ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS); > >> + p = erts_pid2proc_opt(NULL, > >> + 0, > >> + ptimer->timer.id, > >> + > >> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS, > >> + ERTS_P2P_FLG_ALLOW_OTHER_X); > >> if (p) { > >> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) { > >> + if (!p->is_exiting > >> + && !(ptimer->timer.flags & > >> ERTS_PTMR_FLG_CANCELLED)) { > >> ASSERT(*ptimer->timer.timer_ref == ptimer); > >> *ptimer->timer.timer_ref = NULL; > >> (*ptimer->timer.timeout_func)(p); > >> @@ -3028,7 +3029,6 @@ > >> erts_smp_io_unlock(); > >> } > >> } > >> - } > >> free_ptimer(ptimer); > >> } > >> > >> > >> > >> > >> _______________________________________________ > >> erlang-bugs mailing list > >> erlang-bugs@REDACTED > >> http://www.erlang.org/mailman/listinfo/erlang-bugs > >> > >> > >> > > >