From jay@REDACTED Sun May 2 07:35:20 2010 From: jay@REDACTED (Jay) Date: Sun, 02 May 2010 15:35:20 +1000 Subject: Bug in ssl_certificate.erl in R13B04 Message-ID: I came across to the same issue with the latest Yaws (yaws-1.88) together with R13B04. Access to HTTPS was totally denied. I had two choices I'll do a quick hack to R13B04 or go back to R13B03. Of course, I chose the harder way. I am quite sure this is not the right way to fix it, but it seems to work for now: +++ ssl_certificate.erl @@ -147,6 +147,13 @@ public_key:pkix_issuer_id(ErlCertCandidate, self); false -> find_issuer(OtpCert, Key) + end; + {Key, [{_Cert, ErlCertCandidate, not_encrypted}]} -> + case public_key:pkix_is_issuer(OtpCert, ErlCertCandidate) of + true -> + public_key:pkix_issuer_id(ErlCertCandidate, self); + false -> + find_issuer(OtpCert, Key) end end. Best Regards, Jay Sisutec http://www.sisutec.com.au From chetan.ahuja@REDACTED Mon May 3 22:29:04 2010 From: chetan.ahuja@REDACTED (Chetan Ahuja) Date: Mon, 3 May 2010 13:29:04 -0700 Subject: infinite loop when beam.smp compiled with -O2 on debian lenny Message-ID: Hi, We hit a bug while running rabbitmq where the beam.smp process was stuck in a tight loop in the erts_poll_info method. The process was eating up 100% of exactly one core (on a multi core box) and rabbitmq was dysfunctional. Unfortunately I could not create a small test case to reproduce this condition but it would happen quite frequently while rabbitmq was in operation. The C code for the function didn't provide any hints on what would have been spinning in that function (first time looking at this codebase though). Finally looking through the disassembly in gdb, (at the point of where our process was spinning) I saw the following lines in the erts_poll_info_kp method: 0x00000000004f0fe9 : nopl 0x0(%rax) 0x00000000004f0ff0 : jmp 0x4f0fe9 (Similar assembly code can be seen when the KERNEL_POLL option is disabled.) Clearly the above will trivially spin forever anytime we get into that codepath. The above looks suspiciously like some code got optimized out by the compiler leaving the crazy loop code. So I compiled with -O1 and then with no optimization at all. Withe -O1, I saw a a weird jmp insruction jumping to it's own address: 0x0000000000517102 : jmp 0x517102 With no optimization, any of those trivial spins did not exist but I didn't analyze the unoptimized code enough to say whether it can be proven to have an infinite loop (i.e., whether the optimizing compiler is simply doing it's job vs. this being a compiler bug). Anyway, this problem exists at least since erlang-base_12.b.3-dfsg debian package version and has been verified to exists in the github version as of today. Her'es the gcc and debian version info: $ gcc --version gcc-4.3.real (Debian 4.3.2-1.1) 4.3.2 Copyright (C) 2008 Free Software Foundation, Inc. $ cat /etc/debian_version 5.0 I'd be happy to provide any other info as needed. Thanks Chetan Ahuja From dougedmunds@REDACTED Mon May 3 22:51:28 2010 From: dougedmunds@REDACTED (Doug Edmunds (gmail)) Date: Mon, 03 May 2010 13:51:28 -0700 Subject: Bug: process unexpectedly exits loop Message-ID: <4BDF3750.7020200@gmail.com> Hello, I'm posting a module (conn3.erl) below. This module builds a hierarchical tree of PIDs. There are two loops, one for entries, and another for the position in the tree(called 'me'). Each 'entry' runs a copy of the entry_loop. Each entry keeps track of its parent (one pid) and its children (list of 0 or more pids). Entries are not called directly. The me process runs in the me_loop. It manages the entries, and moves up and down in the tree, via: me ! add, me ! up, me ! down, me ! delete, me ! show. The bug I've encountered is when trying to move down the tree when there are multiple children. Here's the basic scenario: After running conn3:start(), type: me ! add. me ! add. to create two children. Now type: me ! down. Because there are more than one child, the code calls indexlist/3, which returns a list of tuples: [{1, PID1}, {2, PID2}, ...}. Then the next line in the 'down' message prints that list. After that the user is supposed to pick the child by using the integer: Input = get_user_input("Enter key: "), But the pid exits the loop before it reaches that line. The 'me' pid is still alive, but exits the loop. I get no error message. It fails both in Windows XP and in a linux os. If someone can figure out what the problem is, much appreciated. Doug Edmunds -------------------- -module(conn3_full). -compile(export_all). %% usage conn:start(). %% then send messages to 'me' (see me_loop) start() -> % process_flag(trap_exit, true), Me = spawn(?MODULE, me_loop,[[],[],[]]), register(me, Me), Top = spawn(?MODULE, entry_loop,[[],[],[]]), register(top, Top), me ! {first_time}, top ! {first_time}, % uncomment this next line to get to the problem faster % me ! add, me ! add, me! show, ok. me_loop(M,K,P) -> % io:format("--me_loop self(): ~p M: ~p K:~p P: ~p~n",[self(),M,K,P]), receive {first_time} -> NM = whereis(top), io:format("--setting me to top: self(): ~p M: ~p K: ~p P: ~p ~n", [self(), whereis(top), K, P]), NM = whereis(top), NK = K, NP = P; show -> io:format("--show self():~p M: ~p K: ~p P: ~p~n",[self(),M,K,P]), NM = M, NK = K, NP = P; add -> %%create an entry Pid = spawn(?MODULE,entry_loop,[[],[],M]), Pid ! {set_pid, Pid}, %%update the entry that 'me' is copying M ! {p_add_kid, Pid}, %%update 'me' K2 = [Pid|K], NM = M, NK = K2, NP = P; del -> case P of [] -> io:format("--At the top~n"); _ -> P ! {p_update_kids, M, K}, ok = connect_kids_to_P(K,P), M ! die, me ! up end, NM = M, NK = K, NP = P; down -> case length(K) of 0 -> io:format("--No kids~n"); 1 -> [Head |_] = K, Head ! {self(), info_request}; _ -> Out = indexlist(1, K, []), ok = io:format("~p~n", [Out]), %%%%%% When more than one 'kid', %%%%%% process drops out of loop here. BUG? Input = get_user_input("Enter key: "), {Int,Rest} = string:to_integer(Input), case is_integer(Int) andalso Rest == [] of true -> Pick = pick_pid(Out,Int), case is_pid(Pick) of true -> Pick ! {self(), info_request}; _ -> io:format("that number is not on the list~n") end; _ -> io:format("must enter an integer~n") end end, NM = M, NK = K, NP = P; up -> case P of [] -> io:format("--At the top~n"); _ -> P ! {self(),info_request} end, NM = M, NK = K, NP = P; {info_requested, M2, K2, P2} -> NM = M2, NK = K2, NP = P2; die -> exit("killed"), io:format("~p died~n", [self()]), NM = M, NK = K, NP = P; Anything -> io:format("--me_loop got this:~p~n", [Anything]), NM = M, NK = K, NP = P end, me_loop (NM,NK,NP). entry_loop(M,K,P) -> % io:format("--entry_loop self(): ~p M: ~p K:~p P: ~p~n",[self(),M,K,P]), receive {first_time} -> io:format("--setting top self(): ~p M: ~p K: ~p P ~p ~n", [self(), whereis(top), K, P]), NM = whereis(top), NK = K, NP = P; show -> io:format("--show self():~p M: ~p K: ~p P: ~p~n",[self(),M,K,P]), NM = M, NK = K, NP = P; {set_pid, Pid} -> NM = Pid, NK = K, NP = P; {From,info_request} -> From ! {info_requested, M, K, P}, NM = M, NK = K, NP = P; {p_update_kids, Kid, GrandKidsList} -> K2 = lists:delete(Kid, K), K3 = lists:append(GrandKidsList,K2), %%still have to move me NM = M, NK = K3, NP = P; {kid_change_p, GrandP} -> P2 = GrandP, NM = M, NK = K, NP = P2; {p_add_kid, Pid} -> K2 = [Pid|K], NM = M, NK = K2, NP = P; % {tell_kids_about_Pid, Pid, Msg} -> % Pidlist = [Pidx || Pidx <- K, is_pid(Pid), Pid /= Pidx], % %%% exclude Pid % %% io:format("--Pid list: ~p~n",[Pidlist]), % ok = tell_list(Pidlist, Pid, Msg), % NM = M, NK = K, NP = P; die -> exit("killed"), io:format("~p died~n", [self()]), NM = M, NK = K, NP = P; Anything -> io:format("--entry_loop Got this:~p~n", [Anything]), NM = M, NK = K, NP = P end, %% io:format("here i am~n"), entry_loop (NM,NK,NP). indexlist(Start, [H|T],Out) -> NewOut = lists: append ([{Start, H}], Out), Start2 = Start+1, indexlist(Start2, T, NewOut); indexlist(_, [], Out) -> lists:reverse(Out). pick_pid(Out, Key) -> NewDict = dict:from_list(Out), case dict:is_key(Key,NewDict) of true -> dict:fetch(Key,NewDict); false -> "no such key" end. get_user_input( Prompt ) -> string:strip( % remove spaces from front and back string:strip( % remove line-feed from the end io:get_line( Prompt), right, $\n)). connect_kids_to_P([],_) -> ok; connect_kids_to_P(K,P) -> [H|T] = K, H ! {kid_change_p,P}, connect_kids_to_P(T,P). %%%not implemented % tell_list([],_,_) -> ok; % tell_list([H|T],X,Msg) -> H ! {Msg, X}, tell_list(T,X, Msg). %%%macro-ish utility b_alive(String) -> % ie b_alive("<0.35.0>") is_process_alive(list_to_pid(String)). From mikpe@REDACTED Mon May 3 23:54:31 2010 From: mikpe@REDACTED (Mikael Pettersson) Date: Mon, 3 May 2010 23:54:31 +0200 Subject: [erlang-bugs] infinite loop when beam.smp compiled with -O2 on debian lenny In-Reply-To: References: Message-ID: <19423.17943.80950.733236@pilspetsen.it.uu.se> Chetan Ahuja writes: > Hi, > > We hit a bug while running rabbitmq where the beam.smp process was stuck > in a tight loop in the erts_poll_info method. > The process was eating up 100% of exactly one core (on a multi core box) and > rabbitmq was dysfunctional. Unfortunately > I could not create a small test case to reproduce this condition but it > would happen quite frequently while rabbitmq was in > operation. > > The C code for the function didn't provide any hints on what would have been > spinning in that function > (first time looking at this codebase though). Finally looking through the > disassembly in gdb, (at the point of where our process was spinning) I saw > the following lines in the > erts_poll_info_kp method: > > > 0x00000000004f0fe9 : nopl 0x0(%rax) > 0x00000000004f0ff0 : jmp 0x4f0fe9 > > > (Similar assembly code can be seen when the KERNEL_POLL option is > disabled.) > > Clearly the above will trivially spin forever anytime we get into that > codepath. The above > looks suspiciously like some code got optimized out by the compiler leaving > the crazy > loop code. > > So I compiled with -O1 and then with no optimization at all. Withe -O1, I > saw a > a weird jmp insruction jumping to it's own address: > > 0x0000000000517102 : jmp 0x517102 > > > With no optimization, any of those trivial spins did not exist but I > didn't analyze the unoptimized > code enough to say whether it can be proven to have an infinite loop (i.e., > whether the optimizing > compiler is simply doing it's job vs. this being a compiler bug). > > Anyway, this problem exists at least since erlang-base_12.b.3-dfsg debian > package version and has been > verified to exists in the github version as of today. > > > Her'es the gcc and debian version info: > $ gcc --version > gcc-4.3.real (Debian 4.3.2-1.1) 4.3.2 > Copyright (C) 2008 Free Software Foundation, Inc. I looked at the procedure in question (not so easy to locate due to some "creative" C preprocessor abuse), and noticed an obvious bug: there's a loop over a linked list that forgets to actually advance the node pointer to the next element. When optimizing, gcc will notice that the loop doesn't terminate, omit the body of the loop (the calculations are dead), which will result in the type of object code shown above. Thus, it's an Erlang VM bug not a gcc miscompilation. Try the patch below and let us know if it solves your problem. /Mikael --- otp_src_R13B03/erts/emulator/sys/common/erl_poll.c.~1~ 2009-03-12 13:16:29.000000000 +0100 +++ otp_src_R13B03/erts/emulator/sys/common/erl_poll.c 2010-05-03 23:41:32.000000000 +0200 @@ -2404,6 +2404,7 @@ ERTS_POLL_EXPORT(erts_poll_info)(ErtsPol while (urqbp) { size += sizeof(ErtsPollSetUpdateRequestsBlock); pending_updates += urqbp->len; + urqbp = urqbp->next; } } #endif From sam@REDACTED Tue May 4 01:23:26 2010 From: sam@REDACTED (Sam Bobroff) Date: Tue, 04 May 2010 09:23:26 +1000 Subject: [erlang-bugs] Bug: process unexpectedly exits loop In-Reply-To: <4BDF3750.7020200@gmail.com> References: <4BDF3750.7020200@gmail.com> Message-ID: <4BDF5AEE.5040401@m5net.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Doug, On 4/05/10 6:51 AM, Doug Edmunds (gmail) wrote: > Hello, > > I'm posting a module (conn3.erl) below. > This module builds a hierarchical tree of PIDs. > There are two loops, one for entries, and another > for the position in the tree(called 'me'). [snip] Getting a backtrace often helps. This is what I did: $ erl Erlang R13B03 (erts-5.7.4) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false] Eshell V5.7.4 (abort with ^G) 1> conn3_full:start(). - --setting me to top: self(): <0.34.0> M: <0.35.0> K: [] P: [] - --setting top self(): <0.35.0> M: <0.35.0> K: [] P [] ok - --show self():<0.34.0> M: <0.35.0> K: [<0.37.0>,<0.36.0>] P: [] 2> me ! down. [{1,<0.37.0>},{2,<0.36.0>}] down 3> {backtrace, BT} = process_info(whereis(me), backtrace). {backtrace,<<"Program counter: 0x0079f3c8 (io:wait_io_mon_reply/2 + 28)\nCP: 0x00000000 (invalid)\narity = 0\n\n0x002f6cbc Ret"...>>} 4> io:fwrite("~s\n", [binary_to_list(BT)]). Program counter: 0x0079f3c8 (io:wait_io_mon_reply/2 + 28) CP: 0x00000000 (invalid) arity = 0 0x002f6cbc Return addr 0x007a14c0 (conn3_full:get_user_input/1 + 20) y(0) #Ref<0.0.0.37> y(1) <0.25.0> 0x002f6cc8 Return addr 0x007a0bf4 (conn3_full:me_loop/3 + 676) 0x002f6ccc Return addr 0x001a1df4 () y(0) [] y(1) [{1,<0.37.0>},{2,<0.36.0>}] y(2) [] y(3) [<0.37.0>,<0.36.0>] y(4) <0.35.0> ok I can see that "me" is still in it's loop and that it's currently in "io:wait_io_mon_reply". I don't know exactly what this function is but my guess would be it's something to do with the shell and io:get_line (actually wait_io_mon_reply) fighting over the terminal input. If we try again with -noshell it might be better but then we won't be able to use the shell to send messages to "me". So, I modified the source to add "me ! down" in the set up sequence at line 16, and also uncommented the debug at the top of me_loop, and now I get: $ erl -noshell -run conn3_full - --me_loop self(): <0.29.0> M: [] K:[] P: [] - --setting top self(): <0.30.0> M: <0.30.0> K: [] P [] - --setting me to top: self(): <0.29.0> M: <0.30.0> K: [] P: [] - --me_loop self(): <0.29.0> M: <0.30.0> K:[] P: [] - --me_loop self(): <0.29.0> M: <0.30.0> K:[<0.31.0>] P: [] - --me_loop self(): <0.29.0> M: <0.30.0> K:[<0.32.0>,<0.31.0>] P: [] - --show self():<0.29.0> M: <0.30.0> K: [<0.32.0>,<0.31.0>] P: [] - --me_loop self(): <0.29.0> M: <0.30.0> K:[<0.32.0>,<0.31.0>] P: [] [{1,<0.32.0>},{2,<0.31.0>}] Enter key: 2 - --me_loop self(): <0.29.0> M: <0.30.0> K:[<0.32.0>,<0.31.0>] P: [] - --me_loop self(): <0.29.0> M: <0.31.0> K:[] P: <0.30.0> I entered "2" at the prompt and the loop has continued :-) Does that help? Sam. - -- Sam Bobroff | sam@REDACTED | M5 Networks Why does my email have those funny headers? Because I use PGP to sign my email (and you should too!): that's how you know it's really from me. See: http://en.wikipedia.org/wiki/Pretty_Good_Privacy -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkvfWu4ACgkQm97/UHSa/ASGbgCfaTDDP+03OxOKCaPzcYT46KGU 0b4An0mNhTQl6prV6AIML03ptKzMBACT =m0bU -----END PGP SIGNATURE----- From chetan.ahuja@REDACTED Tue May 4 01:43:27 2010 From: chetan.ahuja@REDACTED (Chetan Ahuja) Date: Mon, 3 May 2010 16:43:27 -0700 Subject: [erlang-bugs] infinite loop when beam.smp compiled with -O2 on debian lenny In-Reply-To: <19423.17943.80950.733236@pilspetsen.it.uu.se> References: <19423.17943.80950.733236@pilspetsen.it.uu.se> Message-ID: Mikeal, Thanks a lot for that catch. I think that's it. Just did recompiles with your patch (with -O2) and the body of the loop now shows up in the generated code and the trivial spin loop is gone. I got blindsided by the optimizer completely eliminating the body of the loop, due to which I couldn't even see urbqp on the stack at all !! This led me to the assumption that the surrounding macro (ERTS_POLL_USE_UPDATE_REQUESTS_QUEUE) was perhaps undefined and that loop wasn't even compiled in. Yet another strike against coding C in pre-processor macros. Overall, it's a big relief to know that our standard install of gcc is not generating such obviously buggy code. I look forward to seeing the erts_poll_info fix in an upcoming git version. Thanks a lot once again Chetan On Mon, May 3, 2010 at 2:54 PM, Mikael Pettersson wrote: > Chetan Ahuja writes: > > Hi, > > > > We hit a bug while running rabbitmq where the beam.smp process was > stuck > > in a tight loop in the erts_poll_info method. > > The process was eating up 100% of exactly one core (on a multi core box) > and > > rabbitmq was dysfunctional. Unfortunately > > I could not create a small test case to reproduce this condition but it > > would happen quite frequently while rabbitmq was in > > operation. > > > > The C code for the function didn't provide any hints on what would have > been > > spinning in that function > > (first time looking at this codebase though). Finally looking through > the > > disassembly in gdb, (at the point of where our process was spinning) I > saw > > the following lines in the > > erts_poll_info_kp method: > > > > > > 0x00000000004f0fe9 : nopl 0x0(%rax) > > 0x00000000004f0ff0 : jmp 0x4f0fe9 > > > > > > (Similar assembly code can be seen when the KERNEL_POLL option is > > disabled.) > > > > Clearly the above will trivially spin forever anytime we get into that > > codepath. The above > > looks suspiciously like some code got optimized out by the compiler > leaving > > the crazy > > loop code. > > > > So I compiled with -O1 and then with no optimization at all. Withe > -O1, I > > saw a > > a weird jmp insruction jumping to it's own address: > > > > 0x0000000000517102 : jmp 0x517102 > > > > > > With no optimization, any of those trivial spins did not exist but I > > didn't analyze the unoptimized > > code enough to say whether it can be proven to have an infinite loop > (i.e., > > whether the optimizing > > compiler is simply doing it's job vs. this being a compiler bug). > > > > Anyway, this problem exists at least since erlang-base_12.b.3-dfsg > debian > > package version and has been > > verified to exists in the github version as of today. > > > > > > Her'es the gcc and debian version info: > > $ gcc --version > > gcc-4.3.real (Debian 4.3.2-1.1) 4.3.2 > > Copyright (C) 2008 Free Software Foundation, Inc. > > I looked at the procedure in question (not so easy to locate due to > some "creative" C preprocessor abuse), and noticed an obvious bug: > there's a loop over a linked list that forgets to actually advance > the node pointer to the next element. When optimizing, gcc will notice > that the loop doesn't terminate, omit the body of the loop (the > calculations are dead), which will result in the type of object code > shown above. Thus, it's an Erlang VM bug not a gcc miscompilation. > > Try the patch below and let us know if it solves your problem. > > /Mikael > > --- otp_src_R13B03/erts/emulator/sys/common/erl_poll.c.~1~ 2009-03-12 > 13:16:29.000000000 +0100 > +++ otp_src_R13B03/erts/emulator/sys/common/erl_poll.c 2010-05-03 > 23:41:32.000000000 +0200 > @@ -2404,6 +2404,7 @@ ERTS_POLL_EXPORT(erts_poll_info)(ErtsPol > while (urqbp) { > size += sizeof(ErtsPollSetUpdateRequestsBlock); > pending_updates += urqbp->len; > + urqbp = urqbp->next; > } > } > #endif > From dougedmunds@REDACTED Tue May 4 06:41:26 2010 From: dougedmunds@REDACTED (Doug Edmunds (gmail)) Date: Mon, 03 May 2010 21:41:26 -0700 Subject: [erlang-bugs] Bug: process unexpectedly exits loop In-Reply-To: <4BDF5AEE.5040401@m5net.com> References: <4BDF3750.7020200@gmail.com> <4BDF5AEE.5040401@m5net.com> Message-ID: <4BDFA576.6030605@gmail.com> On 5/3/2010 4:23 PM, Sam Bobroff wrote: > So, I modified the source to add "me ! down" in the set up sequence at > line 16, It only works the first time. Follow it with interactive me ! up, then me ! down, and you are right back at the problem. The objective is to allow the 'me' process to traverse the entire tree, and to add/delete branches. Being unable to use me ! down (with a selection) only one time isn't adequate to the goal. I am trying to figure out a way to work around whatever conflict is happening. -- Doug Edmunds From bgustavsson@REDACTED Tue May 4 08:27:26 2010 From: bgustavsson@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=) Date: Tue, 4 May 2010 08:27:26 +0200 Subject: [erlang-bugs] sys:get_status kills gen_servers registered globally with something other than an atom In-Reply-To: <4BDA74F1.3050509@microforte.com> References: <4BDA74F1.3050509@microforte.com> Message-ID: 2010/4/30 Paul Hampson : > Using Erlang R13B04, I'm creating gen_servers with gen_server:start( { global, "Name" } ). This is valid according to the gen_server manpage, which states GlobalName is term(). > > However, if I call sys:get_status({global, "Name"}) (or sys_get:status( global:whereis_name( "Name" ) ) to rule out the name lookup as an issue) the gen_server dies, with: > > ? ?exception error: no true branch found when evaluating an if expression > ? ? ?in function ?gen_server:format_status/2 > ? ? ?in call from sys:get_status/5 > ? ? ?in call from sys:do_cmd/6 > ? ? ?in call from sys:handle_system_msg/8 > > This is because of the following code in gen_server:format_status: > > ? ?NameTag = if is_pid(Name) -> > ? ? ? ? ? ? ?pid_to_list(Name); > ? ? ? ? is_atom(Name) -> > ? ? ? ? ? ? ?Name > ? ? ? ? ?end, > ? ?Header = lists:concat(["Status for generic server ", NameTag]), > > Which fails to handle that Name (which is stripped of {global,} in gen_server:name/1 or gen_server:get_proc_name/1) may be something other than an atom or pid. > > Interestingly, the comment above gen_server:start/3 indicates that the supplied server name is { global, atom() }, not { global, term() } as per the documentation. > > So either the documentation is wrong, or the gen_server implementation/comment is wrong. > > Bizarrely, I'm sure I was able to use sys:get_status against these same gen_servers a month ago, which would have been R13B03 or maybe an B13B03, but erlang/otp on github doesn't indicate any relevant changes. The relevant change is in sys.erl in this commit: http://github.com/erlang/otp/commit/88b530ea24977081020feb2123124063e58dfc12 The gen_server:format_status/2 function did not get called before that change. Since that change introduces useful functionality, we don't plan to revert it, but to fix the damage in R14. One way to fix that would problem could be simply to change the code for calculating NameTag to: NameTag = if is_pid(Name) -> pid_to_list(Name); _ -> Name end, -- Bj?rn Gustavsson, Erlang/OTP, Ericsson AB From mevans@REDACTED Tue May 4 16:12:04 2010 From: mevans@REDACTED (Evans, Matthew) Date: Tue, 4 May 2010 10:12:04 -0400 Subject: pg2 is broken R13B04 Message-ID: Hi, So after more tests I have seen that pg2 is definitely not working as intended. It appears that the root problem is how a new instantiation of pg2 within a cluster of Erlang nodes gets its data. The following sequences of events occur. 1) All nodes do a net_kernel:monitor_nodes(true) in the init function of pg2. 2) The new instance of pg2 will send {new_pg2, node()} to all other nodes in the pool. 3) The new instance of pg2 will send {nodeup, Node} to itself (where nodes is a list of nodes()). What it appears is that when only 2 nodes are in the pool things are generally ok. However, the synchronization process gets muddied when there are many members. The process of updating the nodes is that upon receipt of {new_pg2,Node} or {nodeup,Node} to literally go through the table of pids in the ets pg2_table and build a list similar to: [proxy_micro_cache,[<6325.319.0>,<6324.324.0>]]] This is dispatched to the new pg2 instance. The problem is every node does that, so the new pg2 instance will end up with a table like: [proxy_micro_cache, [<6437.319.0>,<6437.319.0>,<6437.319.0>,<6437.319.0>, <6437.319.0>,<6437.319.0>,<6436.324.0>,<6436.324.0>, <6436.324.0>,<6436.324.0>,<6436.324.0>,<6436.324.0>]]] Where there are many instances for each Pid since each node has sent its copy of the data, causing that process to be replicated many times (i.e. the call to ets:update_counter in pg2:join_group)!!! This problem is compounded further on nodes that join later, or when a VM stops and is restarted. An additional problem arises when another process in the new group sends pg2:join on its own. In that case there is a timing window whereby the new instance could get that new entry more than once. My recommendation is: 1) Have a new pg2:join function called pg2:join_once. In this case a process will never be permitted to have more than 1 join. 2) When a new node joins one could either select only one node to get its data from, or have all nodes in the system send the result of ets:tab2list(pg2_table) to the new node, then have that data inserted directly into its local ets table, as opposed to going through the process of join_group (possibly with the additional step erlang:monitor/2). In this way a process that has been registered more than once would be inserted into the local ets table as a single operation as opposed to many times. 3) Possibly defer new requests to pg2:join on the new instance until synchronization is complete. I understand that gproc is on the way, but I suspect that pg2 does need fixing. Regards Matt From mevans@REDACTED Tue May 4 16:16:18 2010 From: mevans@REDACTED (Evans, Matthew) Date: Tue, 4 May 2010 10:16:18 -0400 Subject: pg2 is broken R13B04 Message-ID: I should also add that we have had pg2 crash a VM on us, where we have some groups with in excess of 1,000,000 members (when only 140 processes have done a pg2:join - the join is done in the process's init function so we know it's only sent once). Of course, what happens is: 1) A huge message is built and sent. 2) This message is created and processes in a list comprehension. We have implemented a workaround by creating pg3 that only permits a single join per process. We did this in the pg2:join_group by modifying the ets:update counter UpdateOp from {2,+1}, to {2,+1,1,1} (and similar logic in pg2:leave_group). Matt ________________________________ From: Evans, Matthew Sent: Tuesday, May 04, 2010 10:12 AM To: 'erlang-bugs@REDACTED' Subject: pg2 is broken R13B04 Hi, So after more tests I have seen that pg2 is definitely not working as intended. It appears that the root problem is how a new instantiation of pg2 within a cluster of Erlang nodes gets its data. The following sequences of events occur. 1) All nodes do a net_kernel:monitor_nodes(true) in the init function of pg2. 2) The new instance of pg2 will send {new_pg2, node()} to all other nodes in the pool. 3) The new instance of pg2 will send {nodeup, Node} to itself (where nodes is a list of nodes()). What it appears is that when only 2 nodes are in the pool things are generally ok. However, the synchronization process gets muddied when there are many members. The process of updating the nodes is that upon receipt of {new_pg2,Node} or {nodeup,Node} to literally go through the table of pids in the ets pg2_table and build a list similar to: [proxy_micro_cache,[<6325.319.0>,<6324.324.0>]]] This is dispatched to the new pg2 instance. The problem is every node does that, so the new pg2 instance will end up with a table like: [proxy_micro_cache, [<6437.319.0>,<6437.319.0>,<6437.319.0>,<6437.319.0>, <6437.319.0>,<6437.319.0>,<6436.324.0>,<6436.324.0>, <6436.324.0>,<6436.324.0>,<6436.324.0>,<6436.324.0>]]] Where there are many instances for each Pid since each node has sent its copy of the data, causing that process to be replicated many times (i.e. the call to ets:update_counter in pg2:join_group)!!! This problem is compounded further on nodes that join later, or when a VM stops and is restarted. An additional problem arises when another process in the new group sends pg2:join on its own. In that case there is a timing window whereby the new instance could get that new entry more than once. My recommendation is: 1) Have a new pg2:join function called pg2:join_once. In this case a process will never be permitted to have more than 1 join. 2) When a new node joins one could either select only one node to get its data from, or have all nodes in the system send the result of ets:tab2list(pg2_table) to the new node, then have that data inserted directly into its local ets table, as opposed to going through the process of join_group (possibly with the additional step erlang:monitor/2). In this way a process that has been registered more than once would be inserted into the local ets table as a single operation as opposed to many times. 3) Possibly defer new requests to pg2:join on the new instance until synchronization is complete. I understand that gproc is on the way, but I suspect that pg2 does need fixing. Regards Matt From peterke@REDACTED Sun May 9 21:27:56 2010 From: peterke@REDACTED (=?ISO-8859-1?Q?P=E9ter_Szil=E1gyi?=) Date: Sun, 9 May 2010 22:27:56 +0300 Subject: epipe error on port, uncatchable exception Message-ID: Hi, I've been trying to get some basic port operations going, but sometimes I get a very peculiar error: epipe exception. The problem is that according to the documentation this should never happen, yet it does, what's more , completely randomly. I can execute the same command and one time is succeeds, another time it fails (let's say 1/10 failures). The even more interesting part is, that I cannot catch the exception. I've written a very basic module to reproduce the error, which just executes "ls -al" 1000 times (see below), passing in a small input data (this is the reason of the crash). The exception below doesn't happen on all machines (I'm using openSuSE 11.2 x64, with Erlang R13B04 (also x64)). On an Ubuntu it ran just fine. Now it may turn out that the OS is doing something strange causing the broken pipes, BUT even so, I should be able to catch it. Any feedback is appreciated, Peter portbug.erl: ---------- -module(portbug). -compile(export_all). crash_it() -> try lists:foreach(fun(_) -> do_something_portlike() end, lists:seq(1, 1000)) catch Class:Exception -> io:format("Caught: ~p:~p", [Class, Exception]) end. do_something_portlike() -> Command = "ls -al", Port = open_port({spawn, Command}, [stream, use_stdio, stderr_to_stdout, binary, eof]), Port ! {self(), {command, <<"some random data">>}}, Port ! {self(), close}. ---------- (shell@REDACTED)232> portbug:crash_it(). exception exit: epipe** ---------- From vances@REDACTED Mon May 10 02:57:41 2010 From: vances@REDACTED (Vance Shipley) Date: Sun, 9 May 2010 20:57:41 -0400 Subject: Inets httpd ignores debug options Message-ID: <20100510005741.GF96961@h216-235-12-174.host.egate.net> The init/1 callback in the supervisor simply ignores the documented debug functions. That wasted a bit of my time. -- -Vance From fritchie@REDACTED Sat May 15 23:01:16 2010 From: fritchie@REDACTED (Scott Lystig Fritchie) Date: Sat, 15 May 2010 16:01:16 -0500 Subject: net_kernel hang, perhaps blocked by busy_dist_port race? Message-ID: <63794.1273957276@snookles.snookles.com> Hi, all. We've been bitten by a rather mysterious bug that has disrupted Erlang message passing on roughly 10% of all nodes in a 100+ node cluster. The same thing happened on 10 nodes within a 2-3 second time window. No further communication with the affected nodes via Erlang message passing is possible. I'm wondering if there's a possible race condition when two nodes A and Z are communicating with each other, like this: 1. Z makes a bunch of RPCs to A. 2. A starts sending RPC replies to Z. 3. Z decides to behave erratically, cause unknown. 4. A's TCP connection to Z becomes "busy", probably because Z cannot or will not read data on the A <-> Z TCP connection. 5. All processes on A that are trying to reply to Z are blocked and unscheduled; 'busy_dist_port' messages are generated for all of them. 6. The 'net_kernel' process on A is one of the procs blocked by the 'busy_dist_port' events. 7. A's connection to Z is broken. The system message reported is: {nodedown_reason,connection_closed},{node_type,visible} ... and then A's 'net_kernel' process remains blocked forever? Or is alive but isn't working correctly? Details below. Sorry I don't have a patch available. -Scott Environment ----------- * ErlangOTP R13B04, -smp auto +A 64 +K true - Patched to change "erts_de_busy_limit" from default 128KB to 4096KB * Linux kernel, RedHat EL4 kernel IIRC of some flavor * NTP configured and running correctly on all machines (to help correlate log file timestamps) * Cluster of 50+ physical machines, 100+ Erlang VMs/nodes total * All nodes are using erlang:system_monitor() BIF for big heap, long garbage collection, and busy dist port events. * All nodes report node up and down events via BIF erlang:process_flag(monitor_nodes,true). Sequence of events summary -------------------------- 1. One node in the cluster (checking the health of other nodes) makes several thousand gen_server RPC calls to various servers on all other nodes in somewhere between 1 and 5 second cycles (depending on what's being monitored). This node's name is 'app@REDACTED'. 2. The 'app@REDACTED' hits a weird problem. We still can't figure out what happened, but it behaved like an extremely intermittent network partition that only affected boxZ. 3. Within 2-3 seconds, 10 nodes on the network become completely unresponsive and cannot recover: * The 'app@REDACTED' node cannot talk to them, but that's not surprising because 'app@REDACTED' is having its own problems. * All other nodes report net_tick_timeout errors. * All attempts such as "erl -sname tmp$$ -remsh app@REDACTED" to connect to the hosed node fail. Sequence of events detail ------------------------- * At time T, +- 1 second, there are multiple reports of the same net distribution port being blocked, e.g. #Port<0.213633546> on app@REDACTED: a. {monitor,<0.6824.3617>,busy_dist_port,#Port<0.213633546>} This is from the VM, triggered by the system_monitor BIF() b. sysmon_server: process <0.6824.3617> info: [{registered_name,foo},{initial_call,{proc_lib,init_p,5}},{current_function,{erlang,bif_return_trap,1}}] This is from the system_monitor event collector, which tries to find some helpful info about the process. All 10 machines register anywhere from 8 to 15 of these pairs of messages. For each machine, all complaints are about the same Erlang port #. * Within T + 1 seconds, there's a report on the same port # that the net_kernel process has been blocked, e.g. on app@REDACTED: a. {monitor,<0.23.0>,busy_dist_port,#Port<0.213633546>} b. sysmon_server: process <0.23.0> info: [{registered_name,net_kernel},{initial_call,{proc_lib,init_p,5}},{current_function,{erlang,bif_return_trap,1}}] There is no direct evidence that shows that the blocked ports, e.g. #Port<0.213633546> on app@REDACTED, are the ones that are used for communication with app@REDACTED node, but it appears quite likely to be true. * Within T + 3 seconds (and usually within T + 2 seconds), there's a report that app@REDACTED is down, e.g. on app@REDACTED: net_kernel: node app@REDACTED down info [{nodedown_reason,connection_closed},{node_type,visible}] * 20 seconds later, all other nodes in the cluster drop their connections to these 9 nodes, due to {nodedown_reason,net_tick_timeout} reason. * No further communication via Erlang message passing is possible: existing nodes cannot reconnect, and new nodes (e.g. "erl -remsh app@REDACTED") cannot connect. * We used "gcore" to snag core dumps from 4 of the 10 affected nodes. The GDB backtrace doesn't reveal much to my untrained eyes. GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... Reading symbols from /lib64/libutil.so.1...done. Loaded symbols for /lib64/libutil.so.1 Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libm.so.6...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /usr/lib64/libncurses.so.5...done. Loaded symbols for /usr/lib64/libncurses.so.5 Reading symbols from /lib64/libpthread.so.0...done. Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/librt.so.1...done. Loaded symbols for /lib64/librt.so.1 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /ert/lib/erlang/lib/crypto-1.6.4/priv/lib/crypto_drv.so...done. Loaded symbols for /ert/lib/erlang/lib/crypto-1.6.4/priv/lib/crypto_drv.so Reading symbols from /ert/openssl/lib/libcrypto.so.0.9.8...done. Loaded symbols for /ert/openssl/lib/libcrypto.so.0.9.8 Core was generated by `/ert/lib/erlang/erts-5.7.5/bin/beam.smp'. [New process 18963] [New process 18964] [New process 18965] [New process 18966] [New process 18967] [New process 18968] [New process 18969] [New process 18970] [New process 18971] [New process 18972] [New process 18973] [New process 18974] [New process 18975] [New process 18976] [New process 18977] [New process 18978] [New process 18979] [New process 18980] [New process 18981] [New process 18982] [New process 18983] [New process 18984] [New process 18985] [New process 18986] [New process 18987] [New process 18988] [New process 18989] [New process 18990] [New process 18991] [New process 18992] [New process 18993] [New process 18994] [New process 18995] [New process 18996] [New process 18997] [New process 18998] [New process 18999] [New process 19000] [New process 19001] [New process 19002] [New process 19003] [New process 19004] [New process 19005] [New process 19006] [New process 19007] [New process 19008] [New process 19009] [New process 19010] [New process 19011] [New process 19012] [New process 19013] [New process 19014] [New process 19015] [New process 19016] [New process 19017] [New process 19018] [New process 19019] [New process 19020] [New process 19021] [New process 19022] [New process 19023] [New process 19024] [New process 19025] [New process 19026] [New process 19027] [New process 19028] [New process 19029] [New process 19030] [New process 19031] [New process 19032] [New process 19033] [New process 19034] [New process 19035] [New process 19036] [New process 19037] [New process 19038] [New process 19039] [New process 19040] [New process 19041] [New process 19042] [New process 19043] [New process 19044] [New process 19045] [New process 18956] #0 0x0000003383c0d2cb in read () from /lib64/libpthread.so.0 (gdb) thread apply all where Thread 84 (process 18956): #0 0x00000033830cc5e2 in select () from /lib64/libc.so.6 #1 0x000000000052a900 in erts_sys_main_thread () at sys/unix/sys.c:3019 #2 0x000000000044d1ef in erl_start (argc=35, argv=) at beam/erl_init.c:1330 #3 0x0000000000430429 in main (argc=0, argv=0x0) at sys/unix/erl_main.c:29 Thread 83 (process 19045): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=16, rq=0x2b035e3bb3b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 82 (process 19044): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=15, rq=0x2b035e3bb1b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 81 (process 19043): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=14, rq=0x2b035e3bafb0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 80 (process 19042): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=13, rq=0x2b035e3badb0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 79 (process 19041): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=12, rq=0x2b035e3babb0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 78 (process 19040): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=11, rq=0x2b035e3ba9b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 77 (process 19039): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=10, rq=0x2b035e3ba7b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 76 (process 19038): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=9, rq=0x2b035e3ba5b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 75 (process 19037): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=8, rq=0x2b035e3ba3b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 74 (process 19036): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=7, rq=0x2b035e3ba1b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 73 (process 19035): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=6, rq=0x2b035e3b9fb0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 72 (process 19034): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=5, rq=0x2b035e3b9db0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 71 (process 19033): #0 0x00000033830d3488 in epoll_wait () from /lib64/libc.so.6 #1 0x0000000000530f04 in erts_poll_wait_kp (ps=0x2b035e2f3d38, pr=0x44b25660, len=0x44b25e7c, utvp=) at sys/common/erl_poll.c:1907 #2 0x0000000000533bdb in erts_check_io_kp (do_wait=) at sys/common/erl_check_io.c:1156 #3 0x000000000049b896 in sched_sys_wait (no=4, rq=0x2b035e3b9bb0) at beam/erl_process.c:785 #4 0x00000000004a1dd2 in schedule (p=, calls=) at beam/erl_process.c:6020 #5 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #6 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #7 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #8 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #9 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 70 (process 19032): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=3, rq=0x2b035e3b99b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 69 (process 19031): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=2, rq=0x2b035e3b97b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 68 (process 19030): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000049b6e2 in sched_cnd_wait (no=1, rq=0x2b035e3b95b0) at beam/erl_threads.h:632 #2 0x00000000004a1a93 in schedule (p=, calls=) at beam/erl_process.c:6026 #3 0x000000000050eb2d in process_main () at beam/beam_emu.c:1161 #4 0x000000000049f322 in sched_thread_func (vesdp=) at beam/erl_process.c:3060 #5 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #6 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #7 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 67 (process 19029): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 66 (process 19028): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 65 (process 19027): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 64 (process 19026): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 63 (process 19025): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 62 (process 19024): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 61 (process 19023): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 60 (process 19022): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 59 (process 19021): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 58 (process 19020): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 57 (process 19019): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 56 (process 19018): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 55 (process 19017): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 54 (process 19016): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 53 (process 19015): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 52 (process 19014): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 51 (process 19013): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 50 (process 19012): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 49 (process 19011): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 48 (process 19010): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 47 (process 19009): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 46 (process 19008): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 45 (process 19007): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 44 (process 19006): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 43 (process 19005): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 42 (process 19004): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 41 (process 19003): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 40 (process 19002): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 39 (process 19001): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 38 (process 19000): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 37 (process 18999): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 36 (process 18998): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 35 (process 18997): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 34 (process 18996): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 33 (process 18995): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 32 (process 18994): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 31 (process 18993): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 30 (process 18992): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 29 (process 18991): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 28 (process 18990): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 27 (process 18989): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 26 (process 18988): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 25 (process 18987): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 24 (process 18986): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 23 (process 18985): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 22 (process 18984): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 21 (process 18983): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 20 (process 18982): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 19 (process 18981): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 18 (process 18980): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 17 (process 18979): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 16 (process 18978): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 15 (process 18977): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 14 (process 18976): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 13 (process 18975): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 12 (process 18974): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 11 (process 18973): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 10 (process 18972): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 9 (process 18971): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 8 (process 18970): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 7 (process 18969): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 6 (process 18968): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 5 (process 18967): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 4 (process 18966): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004e9bdf in async_main (arg=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 3 (process 18965): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000046df1c in sys_msg_dispatcher_func (unused=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 2 (process 18964): #0 0x0000003383c0a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000471ee4 in emergency_watchdog (unused=) at beam/erl_threads.h:632 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 Thread 1 (process 18963): #0 0x0000003383c0d2cb in read () from /lib64/libpthread.so.0 #1 0x000000000052b51e in signal_dispatcher_thread_func (unused=) at sys/unix/sys.c:2913 #2 0x0000000000585064 in thr_wrapper (vtwd=) at common/ethread.c:475 #3 0x0000003383c06367 in start_thread () from /lib64/libpthread.so.0 #4 0x00000033830d309d in clone () from /lib64/libc.so.6 (gdb) From fritchie@REDACTED Sun May 16 02:07:01 2010 From: fritchie@REDACTED (Scott Lystig Fritchie) Date: Sat, 15 May 2010 19:07:01 -0500 Subject: [erlang-bugs] net_kernel hang, perhaps blocked by busy_dist_port race? In-Reply-To: Message of "Sat, 15 May 2010 16:01:16 CDT." <63794.1273957276@snookles.snookles.com> Message-ID: <72573.1273968421@snookles.snookles.com> Following up on my previous message ... I've been able to duplicate this bug, whee! I'm going to try to create a mostly-automatable recipe to make it easier for others to try to reproduce. -Scott Got msg {monitor,<0.21.0>,busy_dist_port,#Port<0.547>} Got msg {monitor,<0.40.0>,busy_dist_port,#Port<0.547>} Got msg {monitor,<0.40.0>,busy_dist_port,#Port<0.547>} Got msg {monitor,<0.21.0>,busy_dist_port,#Port<0.547>} Got msg {nodedown,goofus@REDACTED, [{nodedown_reason,connection_closed},{node_type,visible}]} User switch command --> s --> c Eshell V5.7.5 (abort with ^G) (bar@REDACTED)1> whereis(net_kernel). <0.21.0> (bar@REDACTED)2> process_info(whereis(net_kernel)). [{registered_name,net_kernel}, {current_function,{erlang,bif_return_trap,1}}, {initial_call,{proc_lib,init_p,5}}, {status,suspended}, {message_queue_len,147}, {messages,[tick,tick,tick,tick, {'EXIT',<0.57.0>,connection_closed}, tick,tick,tick,tick,tick,tick, {accept,<0.22.0>,#Port<0.549>,inet,tcp}, tick,tick,tick,tick,tick,tick,tick,tick,tick|...]}, {links,[<0.23.0>,<0.75.0>,<0.18.0>,<0.22.0>,#Port<0.62>]}, {dictionary,[{'$ancestors',[net_sup,kernel_sup,<0.9.0>]}, {longnames,false}, {'$initial_call',{net_kernel,init,1}}]}, {trap_exit,true}, {error_handler,error_handler}, {priority,max}, {group_leader,<0.8.0>}, {total_heap_size,1974}, {heap_size,1597}, {stack_size,12}, {reductions,502181}, {garbage_collection,[{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,65535}, {minor_gcs,825}]}, {suspending,[]}] (bar@REDACTED)3> Bt1 = process_info(whereis(net_kernel), backtrace). <<"...">> (bar@REDACTED)6> io:format("~s\n", [element(2,Bt1)]). Program counter: 0x08243388 (unknown function) CP: 0xb76cd7b4 (gen_server:reply/2 + 104) arity = 1 {#Ref<6666.0.0.35773>,yes} 0xb48893ec Return addr 0xb76cf9f4 (gen_server:handle_msg/5 + 424) y(0) Catch 0xb76cd7b4 (gen_server:reply/2 + 104) 0xb48893f4 Return addr 0xb76a5258 (proc_lib:init_p_do_apply/3 + 28) y(0) net_kernel y(1) [] y(2) net_kernel y(3) <0.18.0> y(4) [] y(5) [] y(6) {state,bar,'bar@REDACTED',shortnames,{tick,<0.23.0>,5000},7000,sys_dist,[{<0.75.0>,'foo@REDACTED'},{<0.57.0>,'goofus@REDACTED'}],[],[{listen,#Port<0.62>,<0.22.0>,{net_address,{{0,0,0,0},48326},"bb3",tcp,inet},inet_tcp_dist}],[],0,all} 0xb4889414 Return addr 0x0824852c () y(0) Catch 0xb76a5268 (proc_lib:init_p_do_apply/3 + 44) ok From fritchie@REDACTED Sun May 16 04:51:59 2010 From: fritchie@REDACTED (Scott Lystig Fritchie) Date: Sat, 15 May 2010 21:51:59 -0500 Subject: [erlang-bugs] net_kernel hang, perhaps blocked by busy_dist_port race? In-Reply-To: Message of "Sat, 15 May 2010 16:01:16 CDT." <63794.1273957276@snookles.snookles.com> Message-ID: <80213.1273978319@snookles.snookles.com> New update: recipe to duplicate. -Scott This recipe works for: Erlang/OTP R13B04 on Linux kernel 2.6.27.41-170.2.117.fc10.i686 Erlang/OTP R13B03 on same Erlang/OTP R13B02 on same Erlang/OTP R13B01 on same Erlang/OTP R12B-5 on same Erlang/OTP R11B-5 on same The recipe requires a bit of luck and human intervention (pressing Control-z at the right moment). But I can get the error to happen within a few minute's worth of trying. Step #1, in terminal #1: Run the following command: erl -sname foo1 -kernel net_ticktime 20 -eval 'register(foo, self()), [net_adm:ping(bar1@REDACTED) || _ <- lists:seq(1,100000)], erlang:display(done).' Step #2, in terminal #2: Run the following command: erl -sname bar1 -kernel net_ticktime 20 -eval 'F = fun(Ff) -> receive X -> io:format("Got msg ~p\n", [X]), Ff(Ff) end end, spawn(fun() -> io:format("I am: ~p\n", [self()]), erlang:system_monitor(self(), [busy_port, busy_dist_port]), net_kernel:monitor_nodes(true, [{node_type, visible}, nodedown_reason]), F(F) end), L1m = lists:seq(1,1000000), [{foo, foo1@REDACTED} ! {bar, L1m} || _ <- lists:seq(1,555000)], erlang:display(done).' As soon as you start seeting these messages in terminal #2: Got msg {monitor,<0.2.0>,busy_dist_port,#Port<0.98>} Got msg {monitor,<0.2.0>,busy_dist_port,#Port<0.98>} Got msg {monitor,<0.21.0>,busy_dist_port,#Port<0.98>} ... then you're ready for step #3. NOTE: On my machine, pid <0.2.0> is the process that is executing the code in the "-eval" flag, and pid <0.21.0> is the 'net_kernel' process. NOTE: Using different releases of Erlang/OTP, the 'net_kernel' pid may vary slightly, but it's in the 20's. Step #3, in terminal #1: When you see the {monitor,<0.21.0>,...} message in terminal #2, press Control-Z in terminal #1. If that message is the most recent/last message, wait for 20 seconds or more. You probably will not see a '{nodedown,...}' message in terminal #2. If you weren't fast or lucky enough, type "fg" here in terminal #1 and then press Control-z again when you're feeling fast or lucky. You're sending big messages over to the terminal #1 node, so if you let this run for too long, you'll run out of memory over there and crash. Step #4, in terminal #3: erl -sname goofus1 -kernel net_ticktime 20 -remsh bar1@REDACTED If you get an error in 10 seconds or less, congratulations! Erlang R13B04 (erts-5.7.5) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false] *** ERROR: Shell process terminated! (^G to start new job) *** Step #5, in terminal #2: Press Control-g, then type the following at the prompts: --> s --> c 2 (bar1@REDACTED)1> process_info(whereis(net_kernel)). (bar1@REDACTED)3> io:format("~s\n", [element(2,process_info(whereis(net_kernel), backtrace))]). In one of my attempts, I managed to get lucky and also managed to block the 'global_name_server' process, <0.12.0>. (bar1@REDACTED)2> process_info(pid(0,12,0)). [{registered_name,global_name_server}, {current_function,{erlang,bif_return_trap,1}}, {initial_call,{proc_lib,init_p,5}}, {status,suspended}, {message_queue_len,0}, {messages,[]}, {links,[<0.13.0>,<0.14.0>,<0.15.0>,<0.10.0>]}, {dictionary,[{{prot_vsn,foo1@REDACTED},5}, {{sync_tag_my,foo1@REDACTED},{1273,970221,319805}}, {'$ancestors',[kernel_sup,<0.9.0>]}, {{sync_tag_his,foo1@REDACTED},{1273,970221,329161}}, {'$initial_call',{global,init,1}}]}, {trap_exit,true}, {error_handler,error_handler}, {priority,normal}, {group_leader,<0.8.0>}, {total_heap_size,987}, {heap_size,610}, {stack_size,24}, {reductions,789}, {garbage_collection,[{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,65535}, {minor_gcs,5}]}, {suspending,[]}] (bar1@REDACTED)4> io:format("~s\n", [element(2,process_info(pid(0,12,0), backtrace))]). Program counter: 0x08243388 (unknown function) CP: 0xb7704738 (global:do_monitor/1 + 172) arity = 1 #Ref<0.0.0.265> 0xb748ab40 Return addr 0xb76fe8ec (global:insert_lock/4 + 100) y(0) [] y(1) <4514.13.0> 0xb748ab4c Return addr 0xb76fe704 (global:handle_set_lock/3 + 236) y(0) [] y(1) [] y(2) {state,true,[],[],[{'foo1@REDACTED',{1273,970221,319805},<0.61.0>}],[],'nonode@REDACTED',<0.13.0>,<0.14.0>,<0.15.0>,no_trace,false} y(3) [{<0.13.0>,<0.13.0>,#Ref<0.0.0.259>}] y(4) <4514.13.0> y(5) {global,[<0.13.0>,<4514.13.0>]} y(6) [<0.13.0>,<4514.13.0>] y(7) global 0xb748ab70 Return addr 0xb76fa488 (global:handle_call/3 + 184) 0xb748ab74 Return addr 0xb769f8e4 (gen_server:handle_msg/5 + 152) 0xb748ab78 Return addr 0xb7675258 (proc_lib:init_p_do_apply/3 + 28) y(0) global y(1) {state,true,[],[],[{'foo1@REDACTED',{1273,970221,319805},<0.61.0>}],[],'nonode@REDACTED',<0.13.0>,<0.14.0>,<0.15.0>,no_trace,false} y(2) global_name_server y(3) <0.10.0> y(4) {set_lock,{global,[<0.13.0>,<4514.13.0>]}} y(5) {<4514.13.0>,{#Ref<4514.0.0.20846>,'bar1@REDACTED'}} y(6) Catch 0xb769f8e4 (gen_server:handle_msg/5 + 152) 0xb748ab98 Return addr 0x0824852c () y(0) Catch 0xb7675268 (proc_lib:init_p_do_apply/3 + 44) ok From fritchie@REDACTED Sun May 16 09:07:53 2010 From: fritchie@REDACTED (Scott Lystig Fritchie) Date: Sun, 16 May 2010 02:07:53 -0500 Subject: [erlang-bugs] net_kernel hang, perhaps blocked by busy_dist_port race? In-Reply-To: Message of "Sat, 15 May 2010 21:51:59 CDT." <80213.1273978319@snookles.snookles.com> Message-ID: <92494.1273993673@snookles.snookles.com> Scott Lystig Fritchie wrote: slf> New update: recipe to duplicate. Nothing like replying to myself again ... so, here's a kludge fix: Allow 'max' priority processes (such as 'net_kernel') to send messages (well, queue them really) on busy distribution ports. --- dist.c 2009-11-20 07:29:24.000000000 -0600 +++ dist.c.slf 2010-05-16 01:23:46.000000000 -0500 @@ -1496,7 +1496,7 @@ dep->qsize += size_obuf(obuf); if (dep->qsize >= ERTS_DE_BUSY_LIMIT) dep->qflgs |= ERTS_DE_QFLG_BUSY; - if (!force_busy && (dep->qflgs & ERTS_DE_QFLG_BUSY)) { + if (!force_busy && (dep->qflgs & ERTS_DE_QFLG_BUSY) && c_p->prio != PRIORITY_MAX) { erts_smp_spin_unlock(&dep->qlock); plp = erts_proclist_create(c_p); It isn't really specific to net_kernel, but there aren't many processes (within OTP, at least) that run at max priority and communicate with the outside world, right? And the worst that could happen would be to have the port's queue get bigger past the ERTS_DE_BUSY_LIMIT before a tick timeout closed the connection (and thus frees the port's queued data), perhaps? -Scott From fritchie@REDACTED Mon May 17 08:34:41 2010 From: fritchie@REDACTED (Scott Lystig Fritchie) Date: Mon, 17 May 2010 01:34:41 -0500 Subject: Why would a <3K heap take 300+ milliseconds to GC? Message-ID: <59823.1274078081@snookles.snookles.com> Hi, sorry this is more of a something-looks-too-weird-to-be-good thing and not an honest bug. We've witnessed a couple of processes on an R13B04 node that started taking over 50 milliseconds for a single GC of a less-than-3K heap ... then stretching to over 300 milliseconds for (what should be) the same amount of garbage. Things get weird the longer these two procs run. They gradually start triggering 'long_gc' events, where the minimum threshold was 50ms. They reach a plateu of roughly 1150-1350 reports/day/process for a few days, then the # of reports/day/process goes exponential: 32K reports in about half of one day, i.e. a GC performed for nearly every timer message received. The worst single GC time is 349ms. The rest of the VM was not very busy: less than 20% CPU used on average for an 8 core box. On May 14th, the day that the long_gc reporting happened nearly once/sec, each for 150-270 milliseconds each, average CPU consumption increased only very slightly. Does this ring a bell? It's really, really strange behavior which (we think) culminated into some behavior unbecoming of a well-behaved virtual machine. If this was a legit signal that something was going wrong, it's worth pursuing IMHO. I can provide the code for the gen_server if it'd be helpful. -Scott Distribution of long_gc times ============================= Frequency Range --------- ----- 2543 50-100 milliseconds 21003 100-199 milliseconds 22704 200-299 milliseconds 417 300-349 milliseconds What the proc does ================== The process receives a message once/second from an external timer that tells it to net_adm:ping/1 a remote node and then rpc:call/4 to the application controller to get the list of apps running on the remote node. The proc's #state record is roughly 6 fields, contains no binaries, and only 1 field is ever updated (with a boolean) on each iteration. The application being monitored is never running, so the same code path is taken (and thus (hopefully) the same amount of garbage) each time. Count of # of long_gc events per day, plus random sample of monitor message =========================================================================== NOTE: There were two processes that were reporting these odd long_gc events, <0.283.0> and <0.285.0>. These are the counts for only one of those two procs. foreach i ( ?? ) ## directories per day, numbered 03-14 echo -n $i " " cat $i/*/*/* | egrep long_gc | egrep '0.283.0' | wc -l cat $i/*/*/* | egrep long_gc | tail -1 echo "" end 03 0 04 0 05 7 {monitor,<0.283.0>,long_gc,[{timeout,57},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,17},{old_heap_size,1387},{heap_size,26}]} 06 341 {monitor,<0.285.0>,long_gc,[{timeout,52},{old_heap_block_size,2584},{heap_block_size,2584},{mbuf_size,0},{stack_size,27},{old_heap_size,79},{heap_size,32}]} 07 1195 {monitor,<0.283.0>,long_gc,[{timeout,95},{old_heap_block_size,2584},{heap_block_size,1597},{mbuf_size,0},{stack_size,17},{old_heap_size,1387},{heap_size,26}]} 08 1152 {monitor,<0.285.0>,long_gc,[{timeout,106},{old_heap_block_size,2584},{heap_block_size,2584},{mbuf_size,0},{stack_size,27},{old_heap_size,79},{heap_size,35}]} 09 1238 {monitor,<0.285.0>,long_gc,[{timeout,162},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,32},{old_heap_size,79},{heap_size,57}]} 10 1324 {monitor,<0.283.0>,long_gc,[{timeout,121},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,34},{old_heap_size,1387},{heap_size,36}]} 11 1332 {monitor,<0.283.0>,long_gc,[{timeout,150},{old_heap_block_size,2584},{heap_block_size,1597},{mbuf_size,0},{stack_size,32},{old_heap_size,1387},{heap_size,32}]} 12 1340 {monitor,<0.283.0>,long_gc,[{timeout,208},{old_heap_block_size,2584},{heap_block_size,377},{mbuf_size,0},{stack_size,14},{old_heap_size,1387},{heap_size,34}]} 13 6198 {monitor,<0.283.0>,long_gc,[{timeout,174},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,25},{old_heap_size,1387},{heap_size,608}]} 14 32540 {monitor,<0.283.0>,long_gc,[{timeout,185},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,25},{old_heap_size,1387},{heap_size,608}]} Stack backtrace of <0.283.0> ============================ =proc:<0.283.0> State: Scheduled Spawned as: proc_lib:init_p/5 Spawned by: <0.92.0> Started: Fri Mar 26 13:26:55 2010 Message queue length: 1 Message queue: [check_status] Number of heap fragments: 0 Heap fragment data: 0 Link list: [] Dictionary: [{'$initial_call',{brick_clientmon,init,1}},{i_am_monitoring,{'down_app@REDACTED',gdss}},{'$ancestors',[brick_mon_sup,brick_admin_sup,brick_sup,<0.88.0>]}] Reductions: 472532540 Stack+heap: 987 OldHeap: 2584 Heap unused: 373 OldHeap unused: 2584 Stack dump: Program counter: 0x00002aaaaab9f2c8 (gen_server:loop/6 + 288) CP: 0x0000000000000000 (invalid) arity = 0 0x00002aaac7d6c120 Return addr 0x00002aaaace58d70 (proc_lib:init_p_do_apply/3 + 56) y(0) [] y(1) infinity y(2) brick_clientmon y(3) {state,'down_app@REDACTED',gdss,#Fun,#Fun,#Ref<0.0.0.2895>,false} y(4) <0.283.0> y(5) <0.92.0> 0x00002aaac7d6c158 Return addr 0x0000000000867be8 () y(0) Catch 0x00002aaaace58d90 (proc_lib:init_p_do_apply/3 + 88) From fritchie@REDACTED Wed May 19 19:42:56 2010 From: fritchie@REDACTED (Scott Lystig Fritchie) Date: Wed, 19 May 2010 12:42:56 -0500 Subject: [erlang-bugs] net_kernel hang, perhaps blocked by busy_dist_port race? In-Reply-To: Message of "Sun, 16 May 2010 02:07:53 CDT." <92494.1273993673@snookles.snookles.com> Message-ID: <38134.1274290976@snookles.snookles.com> {tap} {tap} Is this microphone on? {tap} So, another idea to avoid blocking net_kernel would be an Erlang-only fix: all handle_call() replies would be sent by spawning a new process that calls gen_server:reply(). However, in the recipe that I posted over the weekend, I've also managed to block one or two of global's processes: {registered_name,global_name_server} which is usually <0.12.0> under R13B04 and also <0.13.0> which also appears to be related to global. -Scott From hans.bolinder@REDACTED Fri May 21 08:40:04 2010 From: hans.bolinder@REDACTED (Hans Bolinder) Date: Fri, 21 May 2010 08:40:04 +0200 Subject: [erlang-bugs] net_kernel hang, perhaps blocked by busy_dist_port race? In-Reply-To: <80213.1273978319@snookles.snookles.com> References: <63794.1273957276@snookles.snookles.com> <80213.1273978319@snookles.snookles.com> Message-ID: <19446.10948.910671.108710@ornendil.du.uab.ericsson.se> [Scott Lystig Fritchie:] > New update: recipe to duplicate. Great work. Much appreciated! We've been able to reproduce the scenario you describe. Best regards, Hans Bolinder, Erlang/OTP team, Ericsson From pguyot@REDACTED Fri May 21 10:37:07 2010 From: pguyot@REDACTED (Paul Guyot) Date: Fri, 21 May 2010 10:37:07 +0200 Subject: Code path is not updated if no module is loaded in the appup Message-ID: <2183FE2B-466E-4DA9-AF18-09F6FB0627EC@kallisys.net> Hello, With R13B04 (and earlier), we noticed that the code path is not updated when installing a release if no module is to be added/deleted/reloaded in the .appup file. This is a problem as the release can consist in an update to non-code files, such as MIBs, resources in priv directory, etc. For example, if the appup is the following: {"16", [{"15", [ ]} ], [{"15", [ ]} ] }. The generated relup is the following: {"802",[{"801",[],[point_of_no_return]}],[{"801",[],[point_of_no_return]}]}. application:which_applications returns that the application version "16" is installed and running. However, code:priv_dir and code:get_path refers to version "15" of the application. Regards, Paul From fritchie@REDACTED Fri May 21 17:44:06 2010 From: fritchie@REDACTED (Scott Lystig Fritchie) Date: Fri, 21 May 2010 10:44:06 -0500 Subject: [erlang-bugs] net_kernel hang, perhaps blocked by busy_dist_port race? In-Reply-To: Message of "Fri, 21 May 2010 08:40:04 +0200." <19446.10948.910671.108710@ornendil.du.uab.ericsson.se> Message-ID: <74337.1274456646@snookles.snookles.com> Hans Bolinder wrote: >> New update: recipe to duplicate. hb> Great work. Much appreciated! hb> We've been able to reproduce the scenario you describe. Cool. Attached is another idea for a fix. Instead of a VM fix, it patches net_kernel.erl to avoid direct replies by the 'net_kernel' process. It's perhaps better by not mucking with the VM, perhaps worse because it isn't clear if the same port blocking + process unscheduling for other processes such as 'global_group' could cause similar problems? -Scott -------------- next part -------------- --- /usr/local/src/erlang/otp_src_R13B04/lib/kernel/src/net_kernel.erl.orig 2009-11-20 07:29:33.000000000 -0600 +++ ./net_kernel.erl 2010-05-20 18:21:34.000000000 -0500 @@ -354,13 +354,13 @@ %% The response is delayed until the connection is up and %% running. %% -handle_call({connect, _, Node}, _From, State) when Node =:= node() -> - {reply, true, State}; +handle_call({connect, _, Node}, From, State) when Node =:= node() -> + async_reply({reply, true, State}, From); handle_call({connect, Type, Node}, From, State) -> verbose({connect, Type, Node}, 1, State), case ets:lookup(sys_dist, Node) of [Conn] when Conn#connection.state =:= up -> - {reply, true, State}; + async_reply({reply, true, State}, From); [Conn] when Conn#connection.state =:= pending -> Waiting = Conn#connection.waiting, ets:insert(sys_dist, Conn#connection{waiting = [From|Waiting]}), @@ -376,19 +376,19 @@ {noreply,State#state{conn_owners=Owners}}; _ -> ?connect_failure(Node, {setup_call, failed}), - {reply, false, State} + async_reply({reply, false, State}, From) end end; %% %% Close the connection to Node. %% -handle_call({disconnect, Node}, _From, State) when Node =:= node() -> - {reply, false, State}; -handle_call({disconnect, Node}, _From, State) -> +handle_call({disconnect, Node}, From, State) when Node =:= node() -> + async_reply({reply, false, State}, From); +handle_call({disconnect, Node}, From, State) -> verbose({disconnect, Node}, 1, State), {Reply, State1} = do_disconnect(Node, State), - {reply, Reply, State1}; + async_reply({reply, Reply, State1}, From); %% %% The spawn/4 BIF ends up here. @@ -411,39 +411,39 @@ %% %% Only allow certain nodes. %% -handle_call({allow, Nodes}, _From, State) -> +handle_call({allow, Nodes}, From, State) -> case all_atoms(Nodes) of true -> Allowed = State#state.allowed, - {reply,ok,State#state{allowed = Allowed ++ Nodes}}; + async_reply({reply,ok,State#state{allowed = Allowed ++ Nodes}}, From); false -> - {reply,error,State} + async_reply({reply,error,State}, From) end; %% %% authentication, used by auth. Simply works as this: %% if the message comes through, the other node IS authorized. %% -handle_call({is_auth, _Node}, _From, State) -> - {reply,yes,State}; +handle_call({is_auth, _Node}, From, State) -> + async_reply({reply,yes,State}, From); %% %% Not applicable any longer !? %% handle_call({apply,_Mod,_Fun,_Args}, {From,Tag}, State) when is_pid(From), node(From) =:= node() -> - gen_server:reply({From,Tag}, not_implemented), + async_gen_server_reply({From,Tag}, not_implemented), % Port = State#state.port, % catch apply(Mod,Fun,[Port|Args]), {noreply,State}; -handle_call(longnames, _From, State) -> - {reply, get(longnames), State}; +handle_call(longnames, From, State) -> + async_reply({reply, get(longnames), State}, From); -handle_call({update_publish_nodes, Ns}, _From, State) -> - {reply, ok, State#state{publish_on_nodes = Ns}}; +handle_call({update_publish_nodes, Ns}, From, State) -> + async_reply({reply, ok, State#state{publish_on_nodes = Ns}}, From); -handle_call({publish_on_node, Node}, _From, State) -> +handle_call({publish_on_node, Node}, From, State) -> NewState = case State#state.publish_on_nodes of undefined -> State#state{publish_on_nodes = @@ -457,11 +457,11 @@ Nodes -> lists:member(Node, Nodes) end, - {reply, Publish, NewState}; + async_reply({reply, Publish, NewState}, From); -handle_call({verbose, Level}, _From, State) -> - {reply, State#state.verbose, State#state{verbose = Level}}; +handle_call({verbose, Level}, From, State) -> + async_reply({reply, State#state.verbose, State#state{verbose = Level}}, From); %% %% Set new ticktime @@ -471,16 +471,16 @@ %% #tick_change{} record if the ticker process has been upgraded; %% otherwise, an integer or an atom. -handle_call(ticktime, _, #state{tick = #tick{time = T}} = State) -> - {reply, T, State}; -handle_call(ticktime, _, #state{tick = #tick_change{time = T}} = State) -> - {reply, {ongoing_change_to, T}, State}; +handle_call(ticktime, From, #state{tick = #tick{time = T}} = State) -> + async_reply({reply, T, State}, From); +handle_call(ticktime, From, #state{tick = #tick_change{time = T}} = State) -> + async_reply({reply, {ongoing_change_to, T}, State}, From); -handle_call({new_ticktime,T,_TP}, _, #state{tick = #tick{time = T}} = State) -> +handle_call({new_ticktime,T,_TP}, From, #state{tick = #tick{time = T}} = State) -> ?tckr_dbg(no_tick_change), - {reply, unchanged, State}; + async_reply({reply, unchanged, State}, From); -handle_call({new_ticktime,T,TP}, _, #state{tick = #tick{ticker = Tckr, +handle_call({new_ticktime,T,TP}, From, #state{tick = #tick{ticker = Tckr, time = OT}} = State) -> ?tckr_dbg(initiating_tick_change), start_aux_ticker(T, OT, TP), @@ -493,14 +493,14 @@ ?tckr_dbg(shorter_ticktime), shorter end, - {reply, change_initiated, State#state{tick = #tick_change{ticker = Tckr, + async_reply({reply, change_initiated, State#state{tick = #tick_change{ticker = Tckr, time = T, - how = How}}}; + how = How}}}, From); -handle_call({new_ticktime,_,_}, +handle_call({new_ticktime,From,_}, _, #state{tick = #tick_change{time = T}} = State) -> - {reply, {ongoing_change_to, T}, State}. + async_reply({reply, {ongoing_change_to, T}, State}, From). %% ------------------------------------------------------------ %% handle_cast. @@ -1079,11 +1079,11 @@ spawn_func(link,{From,Tag},M,F,A,Gleader) -> link(From), - gen_server:reply({From,Tag},self()), %% ahhh + async_gen_server_reply({From,Tag},self()), %% ahhh group_leader(Gleader,self()), apply(M,F,A); spawn_func(_,{From,Tag},M,F,A,Gleader) -> - gen_server:reply({From,Tag},self()), %% ahhh + async_gen_server_reply({From,Tag},self()), %% ahhh group_leader(Gleader,self()), apply(M,F,A). @@ -1409,7 +1409,7 @@ reply_waiting1(lists:reverse(Waiting), Rep). reply_waiting1([From|W], Rep) -> - gen_server:reply(From, Rep), + async_gen_server_reply(From, Rep), reply_waiting1(W, Rep); reply_waiting1([], _) -> ok. @@ -1511,3 +1511,10 @@ getnode(P) when is_pid(P) -> node(P); getnode(P) -> P. + +async_reply({reply, Msg, State}, From) -> + async_gen_server_reply(From, Msg), + {noreply, State}. + +async_gen_server_reply(From, Msg) -> + spawn(fun() -> gen_server:reply(From, Msg) end). From bob@REDACTED Wed May 26 02:00:39 2010 From: bob@REDACTED (Bob Ippolito) Date: Tue, 25 May 2010 17:00:39 -0700 Subject: R13B04 inet_res:resolve/4 inet_udp Port leak Message-ID: It appears that there may be an inet_udp Port leak in inet_res:resolve/4, our current workaround is to spawn a new process to call this function. We've noticed this primarily for a service that regularly does a UDP DNS query that fails (because the response is too big) and then we retry over TCP. This is what the state of the process looked like when it was leaking ports: (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, erlang:process_info(whereis(dns_gen_server), links)))). 577 (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- lists:filter(fun erlang:is_port/1, element(2, erlang:process_info(whereis(dns_gen_server), links)))]). [{name,"udp_inet"}] The code looked like this, before the workaround was implemented: %% @spec dns(string()) -> [string()] %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. %% This may return an empty list if there no A records for this Host name. dns(Host) when is_list(Host) -> dns(Host, fun inet_res:resolve/4). dns(Host, ResolveFun) -> case ResolveFun(Host, in, a, []) of {ok, Msg} -> ips_for_answers(Msg); {error, {nxdomain, _}} -> []; {error, timeout} -> %% retry with TCP case ResolveFun(Host, in, a, [{usevc, true}]) of {ok, Msg} -> ips_for_answers(Msg); {error, {nxdomain, _}} -> []; Error = {error, _} -> Error end; Error = {error, _} -> Error end. ips_for_answers(Msg) -> [inet_parse:ntoa(inet_dns:rr(Answer, data)) || Answer <- inet_dns:msg(Msg, anlist)]. The workaround we used was to call it indirectly with this function, I couldn't find anything in OTP that did the same thing that didn't have local call optimizations. %% @spec process_apply(atom(), atom(), [term()]) -> term() %% @doc erlang:apply(M, F, A) in a temporary process and return the results. process_apply(M,F,A) -> %% We can't just use rpc here because there's a local call optimization. Parent = self(), Fun = fun () -> try Parent ! {self(), erlang:apply(M, F, A)} catch Class:Reason -> Stacktrace = erlang:get_stacktrace(), Parent ! {self(), Class, Reason, Stacktrace} end end, {Pid, Ref} = erlang:spawn_monitor(Fun), receive {Pid, Res} -> receive {'DOWN', Ref, process, Pid, _} -> ok end, Res; {Pid, Class, Reason, Stacktrace} -> receive {'DOWN', Ref, process, Pid, _} -> ok end, erlang:error(erlang:raise(Class, Reason, Stacktrace)); {'DOWN', Ref, process, Pid, Reason} -> erlang:exit(Reason) end. From zl9d97p02@REDACTED Wed May 26 02:56:30 2010 From: zl9d97p02@REDACTED (Simon Cornish) Date: Tue, 25 May 2010 17:56:30 -0700 Subject: R12 emulator crashes with zero-length port_control binary In-Reply-To: References: Message-ID: <25026-1274835390-951678@sneakemail.com> If a linked-in driver returns 0 to a port_control call and PORT_CONTROL_FLAG_BINARY is set then the beam emulator will probably crash or otherwise misbehave. Attached is a patch for those who are stuck on R12 and might get bitten by this. Tested on R12B-3, applies also to R12B-5. It's already fixed (in a different way) in R13+ /Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: io.c.patch Type: application/octet-stream Size: 572 bytes Desc: not available URL: From raimo+erlang-bugs@REDACTED Wed May 26 11:06:26 2010 From: raimo+erlang-bugs@REDACTED (Raimo Niskanen) Date: Wed, 26 May 2010 11:06:26 +0200 Subject: [erlang-bugs] R13B04 inet_res:resolve/4 inet_udp Port leak In-Reply-To: References: Message-ID: <20100526090626.GA17931@erix.ericsson.se> By reading the code it seems there is a bug when all nameservers return an answer that causes decode errors, or can not be contacted (enetunreach or econnrefused); then an UDP port (or maybe two; one inet and one inet6) is leaked since the inet_res:udp_close/1 is not called. This should be fixed with: diff --git a/lib/kernel/src/inet_res.erl b/lib/kernel/src/inet_res.erl index 9b9e078..3d38a01 100644 --- a/lib/kernel/src/inet_res.erl +++ b/lib/kernel/src/inet_res.erl @@ -592,6 +592,7 @@ query_retries(_Q, _NSs, _Timer, Retry, Retry, S) -> query_retries(Q, NSs, Timer, Retry, I, S0) -> Num = length(NSs), if Num =:= 0 -> + udp_close(S), {error,timeout}; true -> case query_nss(Q, NSs, Timer, Retry, I, S0, []) of This "retry with TCP" trick of yours should really not be necessary since inet_res retries with TCP if it gets a truncated UDP answer. Have you got some other case when retrying with TCP is essential? Or, does your DNS server produce a (valid?) result that triggers a debug bug in inet_res, causing the decode error, triggering the port leak bug, forcing you to retry with TCP? On Tue, May 25, 2010 at 05:00:39PM -0700, Bob Ippolito wrote: > It appears that there may be an inet_udp Port leak in > inet_res:resolve/4, our current workaround is to spawn a new process > to call this function. We've noticed this primarily for a service that > regularly does a UDP DNS query that fails (because the response is too > big) and then we retry over TCP. > > This is what the state of the process looked like when it was leaking ports: > > (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, > erlang:process_info(whereis(dns_gen_server), links)))). > 577 > (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- > lists:filter(fun erlang:is_port/1, element(2, > erlang:process_info(whereis(dns_gen_server), links)))]). > [{name,"udp_inet"}] > > The code looked like this, before the workaround was implemented: > > %% @spec dns(string()) -> [string()] > %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. > %% This may return an empty list if there no A records for this Host name. > dns(Host) when is_list(Host) -> > dns(Host, fun inet_res:resolve/4). > > dns(Host, ResolveFun) -> > case ResolveFun(Host, in, a, []) of > {ok, Msg} -> > ips_for_answers(Msg); > {error, {nxdomain, _}} -> > []; > {error, timeout} -> > %% retry with TCP > case ResolveFun(Host, in, a, [{usevc, true}]) of > {ok, Msg} -> > ips_for_answers(Msg); > {error, {nxdomain, _}} -> > []; > Error = {error, _} -> > Error > end; > Error = {error, _} -> > Error > end. > > ips_for_answers(Msg) -> > [inet_parse:ntoa(inet_dns:rr(Answer, data)) > || Answer <- inet_dns:msg(Msg, anlist)]. > > The workaround we used was to call it indirectly with this function, I > couldn't find anything in OTP that did the same thing that didn't have > local call optimizations. > > %% @spec process_apply(atom(), atom(), [term()]) -> term() > %% @doc erlang:apply(M, F, A) in a temporary process and return the results. > process_apply(M,F,A) -> > %% We can't just use rpc here because there's a local call optimization. > Parent = self(), > Fun = fun () -> > try > Parent ! {self(), erlang:apply(M, F, A)} > catch > Class:Reason -> > Stacktrace = erlang:get_stacktrace(), > Parent ! {self(), Class, Reason, Stacktrace} > end > end, > {Pid, Ref} = erlang:spawn_monitor(Fun), > receive > {Pid, Res} -> > receive {'DOWN', Ref, process, Pid, _} -> ok end, > Res; > {Pid, Class, Reason, Stacktrace} -> > receive {'DOWN', Ref, process, Pid, _} -> ok end, > erlang:error(erlang:raise(Class, Reason, Stacktrace)); > {'DOWN', Ref, process, Pid, Reason} -> > erlang:exit(Reason) > end. > > ________________________________________________________________ > erlang-bugs (at) erlang.org mailing list. > See http://www.erlang.org/faq.html > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED -- / Raimo Niskanen, Erlang/OTP, Ericsson AB From bob@REDACTED Wed May 26 17:02:48 2010 From: bob@REDACTED (Bob Ippolito) Date: Wed, 26 May 2010 08:02:48 -0700 Subject: [erlang-bugs] R13B04 inet_res:resolve/4 inet_udp Port leak In-Reply-To: <20100526090626.GA17931@erix.ericsson.se> References: <20100526090626.GA17931@erix.ericsson.se> Message-ID: Well, I'm not sure exactly which scenario is happening because I haven't looked at the packets yet, but the manual TCP retry is required. mochi@REDACTED:~$ /mochi/opt/erlang-R13B04/bin/erl Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:8:8] [rq:8] [async-threads:4] [hipe] [kernel-poll:true] Eshell V5.7.5 (abort with ^G) 1> lists:filter(fun erlang:is_port/1, element(2, erlang:process_info(self(), links))). [] 2> inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, []). {error,timeout} 3> lists:filter(fun erlang:is_port/1, element(2, erlang:process_info(self(), links))). [#Port<0.514>] 4> element(1, inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, [usevc])). ok On Wed, May 26, 2010 at 2:06 AM, Raimo Niskanen wrote: > By reading the code it seems there is a bug when all nameservers > return an answer that causes decode errors, or can not be > contacted (enetunreach or econnrefused); then an > UDP port (or maybe two; one inet and one inet6) is leaked > since the inet_res:udp_close/1 is not called. > > This should be fixed with: > > diff --git a/lib/kernel/src/inet_res.erl b/lib/kernel/src/inet_res.erl > index 9b9e078..3d38a01 100644 > --- a/lib/kernel/src/inet_res.erl > +++ b/lib/kernel/src/inet_res.erl > @@ -592,6 +592,7 @@ query_retries(_Q, _NSs, _Timer, Retry, Retry, S) -> > ?query_retries(Q, NSs, Timer, Retry, I, S0) -> > ? ? Num = length(NSs), > ? ? if Num =:= 0 -> > + ? ? ? ? ? udp_close(S), > ? ? ? ? ? ?{error,timeout}; > ? ? ? ?true -> > ? ? ? ? ? ?case query_nss(Q, NSs, Timer, Retry, I, S0, []) of > > This "retry with TCP" trick of yours should really not be necessary > since inet_res retries with TCP if it gets a truncated UDP answer. > Have you got some other case when retrying with TCP is essential? > > Or, does your DNS server produce a (valid?) result that > triggers a debug bug in inet_res, causing the decode error, > triggering the port leak bug, forcing you to retry with TCP? > > On Tue, May 25, 2010 at 05:00:39PM -0700, Bob Ippolito wrote: >> It appears that there may be an inet_udp Port leak in >> inet_res:resolve/4, our current workaround is to spawn a new process >> to call this function. We've noticed this primarily for a service that >> regularly does a UDP DNS query that fails (because the response is too >> big) and then we retry over TCP. >> >> This is what the state of the process looked like when it was leaking ports: >> >> (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, >> erlang:process_info(whereis(dns_gen_server), links)))). >> 577 >> (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- >> lists:filter(fun erlang:is_port/1, element(2, >> erlang:process_info(whereis(dns_gen_server), links)))]). >> [{name,"udp_inet"}] >> >> The code looked like this, before the workaround was implemented: >> >> %% @spec dns(string()) -> [string()] >> %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. >> %% ? ? This may return an empty list if there no A records for this Host name. >> dns(Host) when is_list(Host) -> >> ? ? dns(Host, fun inet_res:resolve/4). >> >> dns(Host, ResolveFun) -> >> ? ? case ResolveFun(Host, in, a, []) of >> ? ? ? ? {ok, Msg} -> >> ? ? ? ? ? ? ips_for_answers(Msg); >> ? ? ? ? {error, {nxdomain, _}} -> >> ? ? ? ? ? ? []; >> ? ? ? ? {error, timeout} -> >> ? ? ? ? ? ? %% retry with TCP >> ? ? ? ? ? ? case ResolveFun(Host, in, a, [{usevc, true}]) of >> ? ? ? ? ? ? ? ? {ok, Msg} -> >> ? ? ? ? ? ? ? ? ? ? ips_for_answers(Msg); >> ? ? ? ? ? ? ? ? {error, {nxdomain, _}} -> >> ? ? ? ? ? ? ? ? ? ? []; >> ? ? ? ? ? ? ? ? Error = {error, _} -> >> ? ? ? ? ? ? ? ? ? ? Error >> ? ? ? ? ? ? end; >> ? ? ? ? Error = {error, _} -> >> ? ? ? ? ? ? Error >> ? ? end. >> >> ips_for_answers(Msg) -> >> ? ? [inet_parse:ntoa(inet_dns:rr(Answer, data)) >> ? ? ?|| Answer <- inet_dns:msg(Msg, anlist)]. >> >> The workaround we used was to call it indirectly with this function, I >> couldn't find anything in OTP that did the same thing that didn't have >> local call optimizations. >> >> %% @spec process_apply(atom(), atom(), [term()]) -> term() >> %% @doc erlang:apply(M, F, A) in a temporary process and return the results. >> process_apply(M,F,A) -> >> ? ? %% We can't just use rpc here because there's a local call optimization. >> ? ? Parent = self(), >> ? ? Fun = fun () -> >> ? ? ? ? ? ? ? ? ? try >> ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), erlang:apply(M, F, A)} >> ? ? ? ? ? ? ? ? ? catch >> ? ? ? ? ? ? ? ? ? ? ? Class:Reason -> >> ? ? ? ? ? ? ? ? ? ? ? ? ? Stacktrace = erlang:get_stacktrace(), >> ? ? ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), Class, Reason, Stacktrace} >> ? ? ? ? ? ? ? ? ? end >> ? ? ? ? ? end, >> ? ? {Pid, Ref} = erlang:spawn_monitor(Fun), >> ? ? receive >> ? ? ? ? {Pid, Res} -> >> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, >> ? ? ? ? ? ? Res; >> ? ? ? ? {Pid, Class, Reason, Stacktrace} -> >> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, >> ? ? ? ? ? ? erlang:error(erlang:raise(Class, Reason, Stacktrace)); >> ? ? ? ? {'DOWN', Ref, process, Pid, Reason} -> >> ? ? ? ? ? ? erlang:exit(Reason) >> ? ? end. >> >> ________________________________________________________________ >> erlang-bugs (at) erlang.org mailing list. >> See http://www.erlang.org/faq.html >> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > -- > > / Raimo Niskanen, Erlang/OTP, Ericsson AB > From bob@REDACTED Wed May 26 20:59:26 2010 From: bob@REDACTED (Bob Ippolito) Date: Wed, 26 May 2010 11:59:26 -0700 Subject: [erlang-bugs] R13B04 inet_res:resolve/4 inet_udp Port leak In-Reply-To: References: <20100526090626.GA17931@erix.ericsson.se> Message-ID: Here's the DNS packet that is being received as a response to the query: 1> inet_dns:decode(<<0,1,131,128,0,1,0,60,0,0,0,0,8,109,111,99,104,105,115,118,110, 3,101,114,108,10,109,111,99,104,105,109,101,100,105,97,3,110, 101,116,0,0,1,0,1>>). {error,fmt} On Wed, May 26, 2010 at 8:02 AM, Bob Ippolito wrote: > Well, I'm not sure exactly which scenario is happening because I > haven't looked at the packets yet, but the manual TCP retry is > required. > > mochi@REDACTED:~$ /mochi/opt/erlang-R13B04/bin/erl > Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:8:8] [rq:8] > [async-threads:4] [hipe] [kernel-poll:true] > > Eshell V5.7.5 ?(abort with ^G) > 1> lists:filter(fun erlang:is_port/1, element(2, > erlang:process_info(self(), links))). > [] > 2> inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, []). > {error,timeout} > 3> lists:filter(fun erlang:is_port/1, element(2, > erlang:process_info(self(), links))). > [#Port<0.514>] > 4> element(1, inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, [usevc])). > ok > > > On Wed, May 26, 2010 at 2:06 AM, Raimo Niskanen > wrote: >> By reading the code it seems there is a bug when all nameservers >> return an answer that causes decode errors, or can not be >> contacted (enetunreach or econnrefused); then an >> UDP port (or maybe two; one inet and one inet6) is leaked >> since the inet_res:udp_close/1 is not called. >> >> This should be fixed with: >> >> diff --git a/lib/kernel/src/inet_res.erl b/lib/kernel/src/inet_res.erl >> index 9b9e078..3d38a01 100644 >> --- a/lib/kernel/src/inet_res.erl >> +++ b/lib/kernel/src/inet_res.erl >> @@ -592,6 +592,7 @@ query_retries(_Q, _NSs, _Timer, Retry, Retry, S) -> >> ?query_retries(Q, NSs, Timer, Retry, I, S0) -> >> ? ? Num = length(NSs), >> ? ? if Num =:= 0 -> >> + ? ? ? ? ? udp_close(S), >> ? ? ? ? ? ?{error,timeout}; >> ? ? ? ?true -> >> ? ? ? ? ? ?case query_nss(Q, NSs, Timer, Retry, I, S0, []) of >> >> This "retry with TCP" trick of yours should really not be necessary >> since inet_res retries with TCP if it gets a truncated UDP answer. >> Have you got some other case when retrying with TCP is essential? >> >> Or, does your DNS server produce a (valid?) result that >> triggers a debug bug in inet_res, causing the decode error, >> triggering the port leak bug, forcing you to retry with TCP? >> >> On Tue, May 25, 2010 at 05:00:39PM -0700, Bob Ippolito wrote: >>> It appears that there may be an inet_udp Port leak in >>> inet_res:resolve/4, our current workaround is to spawn a new process >>> to call this function. We've noticed this primarily for a service that >>> regularly does a UDP DNS query that fails (because the response is too >>> big) and then we retry over TCP. >>> >>> This is what the state of the process looked like when it was leaking ports: >>> >>> (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, >>> erlang:process_info(whereis(dns_gen_server), links)))). >>> 577 >>> (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- >>> lists:filter(fun erlang:is_port/1, element(2, >>> erlang:process_info(whereis(dns_gen_server), links)))]). >>> [{name,"udp_inet"}] >>> >>> The code looked like this, before the workaround was implemented: >>> >>> %% @spec dns(string()) -> [string()] >>> %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. >>> %% ? ? This may return an empty list if there no A records for this Host name. >>> dns(Host) when is_list(Host) -> >>> ? ? dns(Host, fun inet_res:resolve/4). >>> >>> dns(Host, ResolveFun) -> >>> ? ? case ResolveFun(Host, in, a, []) of >>> ? ? ? ? {ok, Msg} -> >>> ? ? ? ? ? ? ips_for_answers(Msg); >>> ? ? ? ? {error, {nxdomain, _}} -> >>> ? ? ? ? ? ? []; >>> ? ? ? ? {error, timeout} -> >>> ? ? ? ? ? ? %% retry with TCP >>> ? ? ? ? ? ? case ResolveFun(Host, in, a, [{usevc, true}]) of >>> ? ? ? ? ? ? ? ? {ok, Msg} -> >>> ? ? ? ? ? ? ? ? ? ? ips_for_answers(Msg); >>> ? ? ? ? ? ? ? ? {error, {nxdomain, _}} -> >>> ? ? ? ? ? ? ? ? ? ? []; >>> ? ? ? ? ? ? ? ? Error = {error, _} -> >>> ? ? ? ? ? ? ? ? ? ? Error >>> ? ? ? ? ? ? end; >>> ? ? ? ? Error = {error, _} -> >>> ? ? ? ? ? ? Error >>> ? ? end. >>> >>> ips_for_answers(Msg) -> >>> ? ? [inet_parse:ntoa(inet_dns:rr(Answer, data)) >>> ? ? ?|| Answer <- inet_dns:msg(Msg, anlist)]. >>> >>> The workaround we used was to call it indirectly with this function, I >>> couldn't find anything in OTP that did the same thing that didn't have >>> local call optimizations. >>> >>> %% @spec process_apply(atom(), atom(), [term()]) -> term() >>> %% @doc erlang:apply(M, F, A) in a temporary process and return the results. >>> process_apply(M,F,A) -> >>> ? ? %% We can't just use rpc here because there's a local call optimization. >>> ? ? Parent = self(), >>> ? ? Fun = fun () -> >>> ? ? ? ? ? ? ? ? ? try >>> ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), erlang:apply(M, F, A)} >>> ? ? ? ? ? ? ? ? ? catch >>> ? ? ? ? ? ? ? ? ? ? ? Class:Reason -> >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Stacktrace = erlang:get_stacktrace(), >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), Class, Reason, Stacktrace} >>> ? ? ? ? ? ? ? ? ? end >>> ? ? ? ? ? end, >>> ? ? {Pid, Ref} = erlang:spawn_monitor(Fun), >>> ? ? receive >>> ? ? ? ? {Pid, Res} -> >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, >>> ? ? ? ? ? ? Res; >>> ? ? ? ? {Pid, Class, Reason, Stacktrace} -> >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, >>> ? ? ? ? ? ? erlang:error(erlang:raise(Class, Reason, Stacktrace)); >>> ? ? ? ? {'DOWN', Ref, process, Pid, Reason} -> >>> ? ? ? ? ? ? erlang:exit(Reason) >>> ? ? end. >>> >>> ________________________________________________________________ >>> erlang-bugs (at) erlang.org mailing list. >>> See http://www.erlang.org/faq.html >>> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED >> >> -- >> >> / Raimo Niskanen, Erlang/OTP, Ericsson AB >> > From zhangjr2009@REDACTED Wed May 26 22:36:45 2010 From: zhangjr2009@REDACTED (JR Zhang) Date: Wed, 26 May 2010 22:36:45 +0200 Subject: Problem with function ethr_rwmutex_tryrlock Message-ID: Hi list, I think the fallback version of function ethr_rwmutex_tryrlock in erts/lib_src/common/ethread.c is not correct. This function should be similar with pthread_rwlock_tryrdlock. For pthread_rwlock_tryrdlock, the calling thread acquires the read lock if a writer does not hold the lock and there are no writers blocked on the lock. But as following code shows, ethr_rwmutex_tryrlock doesn't get the lock when there is no waiting writer, and acquires the lock when there are waiting writers. Am I right? ethr_rwmutex_tryrlock(ethr_rwmutex *rwmtx) { int res; #if ETHR_XCHK if (!rwmtx || rwmtx->initialized != ETHR_RWMUTEX_INITIALIZED) { ASSERT(0); return EINVAL; } #endif res = ethr_mutex_trylock__(&rwmtx->mtx); if (res != 0) return res; if (!rwmtx->waiting_writers) { res = ethr_mutex_unlock__(&rwmtx->mtx); if (res == 0) return EBUSY; return res; } rwmtx->readers++; return ethr_mutex_unlock__(&rwmtx->mtx); } Best Regards, Jianrong Zhang From raimo+erlang-bugs@REDACTED Thu May 27 09:43:54 2010 From: raimo+erlang-bugs@REDACTED (Raimo Niskanen) Date: Thu, 27 May 2010 09:43:54 +0200 Subject: [erlang-bugs] R13B04 inet_res:resolve/4 inet_udp Port leak In-Reply-To: References: <20100526090626.GA17931@erix.ericsson.se> Message-ID: <20100527074354.GA5584@erix.ericsson.se> On Wed, May 26, 2010 at 11:59:26AM -0700, Bob Ippolito wrote: > Here's the DNS packet that is being received as a response to the query: > > 1> inet_dns:decode(<<0,1,131,128,0,1,0,60,0,0,0,0,8,109,111,99,104,105,115,118,110, > 3,101,114,108,10,109,111,99,104,105,109,101,100,105,97,3,110, > 101,116,0,0,1,0,1>>). > {error,fmt} Thank you very much! I was about to give you detailed instructions about how to dig that up :-) I have spotted the problem. The DNS reply packet has got the TC (TrunCation) bit set and claims to contain 60 answer records, but actually contains zero. inet_dns expects to find 60 answer records if it says so. This is a hazy part of the DNS specifications and the resolver I tested truncation on did not do this kind of self-contradiction, but it _may_ be allowed by the specification... I regard it as a bug (or at least need-to-fix-problem) in inet_dns since it should be real-world compatible not just specification compatible. It should allow record shortage in a section if the TC bit is set. I'll try to fix it in R14A. Can you try my patch adding a missing udp_close(S) to see if it stops the leaking port problem? That is a more serious bug. > > On Wed, May 26, 2010 at 8:02 AM, Bob Ippolito wrote: > > Well, I'm not sure exactly which scenario is happening because I > > haven't looked at the packets yet, but the manual TCP retry is > > required. > > > > mochi@REDACTED:~$ /mochi/opt/erlang-R13B04/bin/erl > > Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:8:8] [rq:8] > > [async-threads:4] [hipe] [kernel-poll:true] > > > > Eshell V5.7.5 ?(abort with ^G) > > 1> lists:filter(fun erlang:is_port/1, element(2, > > erlang:process_info(self(), links))). > > [] > > 2> inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, []). > > {error,timeout} > > 3> lists:filter(fun erlang:is_port/1, element(2, > > erlang:process_info(self(), links))). > > [#Port<0.514>] > > 4> element(1, inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, [usevc])). > > ok > > > > > > On Wed, May 26, 2010 at 2:06 AM, Raimo Niskanen > > wrote: > >> By reading the code it seems there is a bug when all nameservers > >> return an answer that causes decode errors, or can not be > >> contacted (enetunreach or econnrefused); then an > >> UDP port (or maybe two; one inet and one inet6) is leaked > >> since the inet_res:udp_close/1 is not called. > >> > >> This should be fixed with: > >> > >> diff --git a/lib/kernel/src/inet_res.erl b/lib/kernel/src/inet_res.erl > >> index 9b9e078..3d38a01 100644 > >> --- a/lib/kernel/src/inet_res.erl > >> +++ b/lib/kernel/src/inet_res.erl > >> @@ -592,6 +592,7 @@ query_retries(_Q, _NSs, _Timer, Retry, Retry, S) -> > >> ?query_retries(Q, NSs, Timer, Retry, I, S0) -> > >> ? ? Num = length(NSs), > >> ? ? if Num =:= 0 -> > >> + ? ? ? ? ? udp_close(S), > >> ? ? ? ? ? ?{error,timeout}; > >> ? ? ? ?true -> > >> ? ? ? ? ? ?case query_nss(Q, NSs, Timer, Retry, I, S0, []) of > >> > >> This "retry with TCP" trick of yours should really not be necessary > >> since inet_res retries with TCP if it gets a truncated UDP answer. > >> Have you got some other case when retrying with TCP is essential? > >> > >> Or, does your DNS server produce a (valid?) result that > >> triggers a debug bug in inet_res, causing the decode error, > >> triggering the port leak bug, forcing you to retry with TCP? > >> > >> On Tue, May 25, 2010 at 05:00:39PM -0700, Bob Ippolito wrote: > >>> It appears that there may be an inet_udp Port leak in > >>> inet_res:resolve/4, our current workaround is to spawn a new process > >>> to call this function. We've noticed this primarily for a service that > >>> regularly does a UDP DNS query that fails (because the response is too > >>> big) and then we retry over TCP. > >>> > >>> This is what the state of the process looked like when it was leaking ports: > >>> > >>> (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, > >>> erlang:process_info(whereis(dns_gen_server), links)))). > >>> 577 > >>> (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- > >>> lists:filter(fun erlang:is_port/1, element(2, > >>> erlang:process_info(whereis(dns_gen_server), links)))]). > >>> [{name,"udp_inet"}] > >>> > >>> The code looked like this, before the workaround was implemented: > >>> > >>> %% @spec dns(string()) -> [string()] > >>> %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. > >>> %% ? ? This may return an empty list if there no A records for this Host name. > >>> dns(Host) when is_list(Host) -> > >>> ? ? dns(Host, fun inet_res:resolve/4). > >>> > >>> dns(Host, ResolveFun) -> > >>> ? ? case ResolveFun(Host, in, a, []) of > >>> ? ? ? ? {ok, Msg} -> > >>> ? ? ? ? ? ? ips_for_answers(Msg); > >>> ? ? ? ? {error, {nxdomain, _}} -> > >>> ? ? ? ? ? ? []; > >>> ? ? ? ? {error, timeout} -> > >>> ? ? ? ? ? ? %% retry with TCP > >>> ? ? ? ? ? ? case ResolveFun(Host, in, a, [{usevc, true}]) of > >>> ? ? ? ? ? ? ? ? {ok, Msg} -> > >>> ? ? ? ? ? ? ? ? ? ? ips_for_answers(Msg); > >>> ? ? ? ? ? ? ? ? {error, {nxdomain, _}} -> > >>> ? ? ? ? ? ? ? ? ? ? []; > >>> ? ? ? ? ? ? ? ? Error = {error, _} -> > >>> ? ? ? ? ? ? ? ? ? ? Error > >>> ? ? ? ? ? ? end; > >>> ? ? ? ? Error = {error, _} -> > >>> ? ? ? ? ? ? Error > >>> ? ? end. > >>> > >>> ips_for_answers(Msg) -> > >>> ? ? [inet_parse:ntoa(inet_dns:rr(Answer, data)) > >>> ? ? ?|| Answer <- inet_dns:msg(Msg, anlist)]. > >>> > >>> The workaround we used was to call it indirectly with this function, I > >>> couldn't find anything in OTP that did the same thing that didn't have > >>> local call optimizations. > >>> > >>> %% @spec process_apply(atom(), atom(), [term()]) -> term() > >>> %% @doc erlang:apply(M, F, A) in a temporary process and return the results. > >>> process_apply(M,F,A) -> > >>> ? ? %% We can't just use rpc here because there's a local call optimization. > >>> ? ? Parent = self(), > >>> ? ? Fun = fun () -> > >>> ? ? ? ? ? ? ? ? ? try > >>> ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), erlang:apply(M, F, A)} > >>> ? ? ? ? ? ? ? ? ? catch > >>> ? ? ? ? ? ? ? ? ? ? ? Class:Reason -> > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Stacktrace = erlang:get_stacktrace(), > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), Class, Reason, Stacktrace} > >>> ? ? ? ? ? ? ? ? ? end > >>> ? ? ? ? ? end, > >>> ? ? {Pid, Ref} = erlang:spawn_monitor(Fun), > >>> ? ? receive > >>> ? ? ? ? {Pid, Res} -> > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, > >>> ? ? ? ? ? ? Res; > >>> ? ? ? ? {Pid, Class, Reason, Stacktrace} -> > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, > >>> ? ? ? ? ? ? erlang:error(erlang:raise(Class, Reason, Stacktrace)); > >>> ? ? ? ? {'DOWN', Ref, process, Pid, Reason} -> > >>> ? ? ? ? ? ? erlang:exit(Reason) > >>> ? ? end. > >>> > >>> ________________________________________________________________ > >>> erlang-bugs (at) erlang.org mailing list. > >>> See http://www.erlang.org/faq.html > >>> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > >> > >> -- > >> > >> / Raimo Niskanen, Erlang/OTP, Ericsson AB > >> > > > > ________________________________________________________________ > erlang-bugs (at) erlang.org mailing list. > See http://www.erlang.org/faq.html > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > -- / Raimo Niskanen, Erlang/OTP, Ericsson AB From raimo+erlang-bugs@REDACTED Thu May 27 12:02:48 2010 From: raimo+erlang-bugs@REDACTED (Raimo Niskanen) Date: Thu, 27 May 2010 12:02:48 +0200 Subject: [erlang-bugs] R13B04 inet_res:resolve/4 inet_udp Port leak In-Reply-To: <20100527074354.GA5584@erix.ericsson.se> References: <20100526090626.GA17931@erix.ericsson.se> <20100527074354.GA5584@erix.ericsson.se> Message-ID: <20100527100248.GA3917@erix.ericsson.se> I have created a fix for these problems: git fetch git://github.com/RaimoNiskanen/otp.git rn/resolver-leaking-ports It will be included in 'pu'. Unfortunately, the second commit eliminates the bug trigger for what the first commit fixes. So to test if the bug fix is fixing the bug, one should apply the first commit only. On Thu, May 27, 2010 at 09:43:54AM +0200, Raimo Niskanen wrote: > On Wed, May 26, 2010 at 11:59:26AM -0700, Bob Ippolito wrote: > > Here's the DNS packet that is being received as a response to the query: > > > > 1> inet_dns:decode(<<0,1,131,128,0,1,0,60,0,0,0,0,8,109,111,99,104,105,115,118,110, > > 3,101,114,108,10,109,111,99,104,105,109,101,100,105,97,3,110, > > 101,116,0,0,1,0,1>>). > > {error,fmt} > > Thank you very much! I was about to give you detailed instructions > about how to dig that up :-) > > I have spotted the problem. > > The DNS reply packet has got the TC (TrunCation) bit set and claims to contain > 60 answer records, but actually contains zero. inet_dns expects to find > 60 answer records if it says so. This is a hazy part of the > DNS specifications and the resolver I tested truncation on did > not do this kind of self-contradiction, but it _may_ be allowed > by the specification... > > I regard it as a bug (or at least need-to-fix-problem) in inet_dns > since it should be real-world compatible not just specification compatible. > It should allow record shortage in a section if the TC bit is set. > I'll try to fix it in R14A. > > Can you try my patch adding a missing udp_close(S) to see > if it stops the leaking port problem? That is a more serious bug. > > > > > On Wed, May 26, 2010 at 8:02 AM, Bob Ippolito wrote: > > > Well, I'm not sure exactly which scenario is happening because I > > > haven't looked at the packets yet, but the manual TCP retry is > > > required. > > > > > > mochi@REDACTED:~$ /mochi/opt/erlang-R13B04/bin/erl > > > Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:8:8] [rq:8] > > > [async-threads:4] [hipe] [kernel-poll:true] > > > > > > Eshell V5.7.5 ?(abort with ^G) > > > 1> lists:filter(fun erlang:is_port/1, element(2, > > > erlang:process_info(self(), links))). > > > [] > > > 2> inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, []). > > > {error,timeout} > > > 3> lists:filter(fun erlang:is_port/1, element(2, > > > erlang:process_info(self(), links))). > > > [#Port<0.514>] > > > 4> element(1, inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, [usevc])). > > > ok > > > > > > > > > On Wed, May 26, 2010 at 2:06 AM, Raimo Niskanen > > > wrote: > > >> By reading the code it seems there is a bug when all nameservers > > >> return an answer that causes decode errors, or can not be > > >> contacted (enetunreach or econnrefused); then an > > >> UDP port (or maybe two; one inet and one inet6) is leaked > > >> since the inet_res:udp_close/1 is not called. > > >> > > >> This should be fixed with: > > >> > > >> diff --git a/lib/kernel/src/inet_res.erl b/lib/kernel/src/inet_res.erl > > >> index 9b9e078..3d38a01 100644 > > >> --- a/lib/kernel/src/inet_res.erl > > >> +++ b/lib/kernel/src/inet_res.erl > > >> @@ -592,6 +592,7 @@ query_retries(_Q, _NSs, _Timer, Retry, Retry, S) -> > > >> ?query_retries(Q, NSs, Timer, Retry, I, S0) -> > > >> ? ? Num = length(NSs), > > >> ? ? if Num =:= 0 -> > > >> + ? ? ? ? ? udp_close(S), > > >> ? ? ? ? ? ?{error,timeout}; > > >> ? ? ? ?true -> > > >> ? ? ? ? ? ?case query_nss(Q, NSs, Timer, Retry, I, S0, []) of > > >> > > >> This "retry with TCP" trick of yours should really not be necessary > > >> since inet_res retries with TCP if it gets a truncated UDP answer. > > >> Have you got some other case when retrying with TCP is essential? > > >> > > >> Or, does your DNS server produce a (valid?) result that > > >> triggers a debug bug in inet_res, causing the decode error, > > >> triggering the port leak bug, forcing you to retry with TCP? > > >> > > >> On Tue, May 25, 2010 at 05:00:39PM -0700, Bob Ippolito wrote: > > >>> It appears that there may be an inet_udp Port leak in > > >>> inet_res:resolve/4, our current workaround is to spawn a new process > > >>> to call this function. We've noticed this primarily for a service that > > >>> regularly does a UDP DNS query that fails (because the response is too > > >>> big) and then we retry over TCP. > > >>> > > >>> This is what the state of the process looked like when it was leaking ports: > > >>> > > >>> (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, > > >>> erlang:process_info(whereis(dns_gen_server), links)))). > > >>> 577 > > >>> (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- > > >>> lists:filter(fun erlang:is_port/1, element(2, > > >>> erlang:process_info(whereis(dns_gen_server), links)))]). > > >>> [{name,"udp_inet"}] > > >>> > > >>> The code looked like this, before the workaround was implemented: > > >>> > > >>> %% @spec dns(string()) -> [string()] > > >>> %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. > > >>> %% ? ? This may return an empty list if there no A records for this Host name. > > >>> dns(Host) when is_list(Host) -> > > >>> ? ? dns(Host, fun inet_res:resolve/4). > > >>> > > >>> dns(Host, ResolveFun) -> > > >>> ? ? case ResolveFun(Host, in, a, []) of > > >>> ? ? ? ? {ok, Msg} -> > > >>> ? ? ? ? ? ? ips_for_answers(Msg); > > >>> ? ? ? ? {error, {nxdomain, _}} -> > > >>> ? ? ? ? ? ? []; > > >>> ? ? ? ? {error, timeout} -> > > >>> ? ? ? ? ? ? %% retry with TCP > > >>> ? ? ? ? ? ? case ResolveFun(Host, in, a, [{usevc, true}]) of > > >>> ? ? ? ? ? ? ? ? {ok, Msg} -> > > >>> ? ? ? ? ? ? ? ? ? ? ips_for_answers(Msg); > > >>> ? ? ? ? ? ? ? ? {error, {nxdomain, _}} -> > > >>> ? ? ? ? ? ? ? ? ? ? []; > > >>> ? ? ? ? ? ? ? ? Error = {error, _} -> > > >>> ? ? ? ? ? ? ? ? ? ? Error > > >>> ? ? ? ? ? ? end; > > >>> ? ? ? ? Error = {error, _} -> > > >>> ? ? ? ? ? ? Error > > >>> ? ? end. > > >>> > > >>> ips_for_answers(Msg) -> > > >>> ? ? [inet_parse:ntoa(inet_dns:rr(Answer, data)) > > >>> ? ? ?|| Answer <- inet_dns:msg(Msg, anlist)]. > > >>> > > >>> The workaround we used was to call it indirectly with this function, I > > >>> couldn't find anything in OTP that did the same thing that didn't have > > >>> local call optimizations. > > >>> > > >>> %% @spec process_apply(atom(), atom(), [term()]) -> term() > > >>> %% @doc erlang:apply(M, F, A) in a temporary process and return the results. > > >>> process_apply(M,F,A) -> > > >>> ? ? %% We can't just use rpc here because there's a local call optimization. > > >>> ? ? Parent = self(), > > >>> ? ? Fun = fun () -> > > >>> ? ? ? ? ? ? ? ? ? try > > >>> ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), erlang:apply(M, F, A)} > > >>> ? ? ? ? ? ? ? ? ? catch > > >>> ? ? ? ? ? ? ? ? ? ? ? Class:Reason -> > > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Stacktrace = erlang:get_stacktrace(), > > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), Class, Reason, Stacktrace} > > >>> ? ? ? ? ? ? ? ? ? end > > >>> ? ? ? ? ? end, > > >>> ? ? {Pid, Ref} = erlang:spawn_monitor(Fun), > > >>> ? ? receive > > >>> ? ? ? ? {Pid, Res} -> > > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, > > >>> ? ? ? ? ? ? Res; > > >>> ? ? ? ? {Pid, Class, Reason, Stacktrace} -> > > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, > > >>> ? ? ? ? ? ? erlang:error(erlang:raise(Class, Reason, Stacktrace)); > > >>> ? ? ? ? {'DOWN', Ref, process, Pid, Reason} -> > > >>> ? ? ? ? ? ? erlang:exit(Reason) > > >>> ? ? end. > > >>> > > >>> ________________________________________________________________ > > >>> erlang-bugs (at) erlang.org mailing list. > > >>> See http://www.erlang.org/faq.html > > >>> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > >> > > >> -- > > >> > > >> / Raimo Niskanen, Erlang/OTP, Ericsson AB > > >> > > > > > > > ________________________________________________________________ > > erlang-bugs (at) erlang.org mailing list. > > See http://www.erlang.org/faq.html > > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > > > -- > > / Raimo Niskanen, Erlang/OTP, Ericsson AB > > ________________________________________________________________ > erlang-bugs (at) erlang.org mailing list. > See http://www.erlang.org/faq.html > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED -- / Raimo Niskanen, Erlang/OTP, Ericsson AB From rickard@REDACTED Thu May 27 15:37:10 2010 From: rickard@REDACTED (Rickard Green) Date: Thu, 27 May 2010 15:37:10 +0200 Subject: Problem with function ethr_rwmutex_tryrlock Message-ID: <4BFE7586.8020102@erlang.org> > Hi list, > > I think the fallback version of function ethr_rwmutex_tryrlock in > erts/lib_src/common/ethread.c is not correct. This function should be > similar with pthread_rwlock_tryrdlock. For pthread_rwlock_tryrdlock, the > calling thread acquires the read lock if a writer does not hold the lock and > there are no writers blocked on the lock. But as following code shows, > ethr_rwmutex_tryrlock doesn't get the lock when there is no waiting writer, > and acquires the lock when there are waiting writers. Am I right? > > ethr_rwmutex_tryrlock(ethr_rwmutex *rwmtx) > { > int res; > #if ETHR_XCHK > if (!rwmtx || rwmtx->initialized != ETHR_RWMUTEX_INITIALIZED) { > ASSERT(0); > return EINVAL; > } > #endif > res = ethr_mutex_trylock__(&rwmtx->mtx); > if (res != 0) > return res; > if (!rwmtx->waiting_writers) { > res = ethr_mutex_unlock__(&rwmtx->mtx); > if (res == 0) > return EBUSY; > return res; > } > rwmtx->readers++; > return ethr_mutex_unlock__(&rwmtx->mtx); > } > > Best Regards, > Jianrong Zhang > Yes, you are right. if (!rwmtx->waiting_writers) { should be if (rwmtx->waiting_writers) { Thanks! It will be fixed in the upcomming release. Regards, Rickard -- Rickard Green, Erlang/OTP, Ericsson AB. From bob@REDACTED Thu May 27 16:11:03 2010 From: bob@REDACTED (Bob Ippolito) Date: Thu, 27 May 2010 07:11:03 -0700 Subject: [erlang-bugs] R13B04 inet_res:resolve/4 inet_udp Port leak In-Reply-To: <20100527100248.GA3917@erix.ericsson.se> References: <20100526090626.GA17931@erix.ericsson.se> <20100527074354.GA5584@erix.ericsson.se> <20100527100248.GA3917@erix.ericsson.se> Message-ID: I can confirm that the first commit fixes the port leak bug. On Thu, May 27, 2010 at 3:02 AM, Raimo Niskanen wrote: > I have created a fix for these problems: > ? ?git fetch git://github.com/RaimoNiskanen/otp.git rn/resolver-leaking-ports > > It will be included in 'pu'. > > Unfortunately, the second commit eliminates the bug trigger > for what the first commit fixes. So to test if the bug fix > is fixing the bug, one should apply the first commit only. > > On Thu, May 27, 2010 at 09:43:54AM +0200, Raimo Niskanen wrote: >> On Wed, May 26, 2010 at 11:59:26AM -0700, Bob Ippolito wrote: >> > Here's the DNS packet that is being received as a response to the query: >> > >> > 1> inet_dns:decode(<<0,1,131,128,0,1,0,60,0,0,0,0,8,109,111,99,104,105,115,118,110, >> > 3,101,114,108,10,109,111,99,104,105,109,101,100,105,97,3,110, >> > 101,116,0,0,1,0,1>>). >> > {error,fmt} >> >> Thank you very much! I was about to give you detailed instructions >> about how to dig that up :-) >> >> I have spotted the problem. >> >> The DNS reply packet has got the TC (TrunCation) bit set and claims to contain >> 60 answer records, but actually contains zero. inet_dns expects to find >> 60 answer records if it says so. This is a hazy part of the >> DNS specifications and the resolver I tested truncation on did >> not do this kind of self-contradiction, but it _may_ be allowed >> by the specification... >> >> I regard it as a bug (or at least need-to-fix-problem) in inet_dns >> since it should be real-world compatible not just specification compatible. >> It should allow record shortage in a section if the TC bit is set. >> I'll try to fix it in R14A. >> >> Can you try my patch adding a missing udp_close(S) to see >> if it stops the leaking port problem? That is a more serious bug. >> >> > >> > On Wed, May 26, 2010 at 8:02 AM, Bob Ippolito wrote: >> > > Well, I'm not sure exactly which scenario is happening because I >> > > haven't looked at the packets yet, but the manual TCP retry is >> > > required. >> > > >> > > mochi@REDACTED:~$ /mochi/opt/erlang-R13B04/bin/erl >> > > Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:8:8] [rq:8] >> > > [async-threads:4] [hipe] [kernel-poll:true] >> > > >> > > Eshell V5.7.5 ?(abort with ^G) >> > > 1> lists:filter(fun erlang:is_port/1, element(2, >> > > erlang:process_info(self(), links))). >> > > [] >> > > 2> inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, []). >> > > {error,timeout} >> > > 3> lists:filter(fun erlang:is_port/1, element(2, >> > > erlang:process_info(self(), links))). >> > > [#Port<0.514>] >> > > 4> element(1, inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, [usevc])). >> > > ok >> > > >> > > >> > > On Wed, May 26, 2010 at 2:06 AM, Raimo Niskanen >> > > wrote: >> > >> By reading the code it seems there is a bug when all nameservers >> > >> return an answer that causes decode errors, or can not be >> > >> contacted (enetunreach or econnrefused); then an >> > >> UDP port (or maybe two; one inet and one inet6) is leaked >> > >> since the inet_res:udp_close/1 is not called. >> > >> >> > >> This should be fixed with: >> > >> >> > >> diff --git a/lib/kernel/src/inet_res.erl b/lib/kernel/src/inet_res.erl >> > >> index 9b9e078..3d38a01 100644 >> > >> --- a/lib/kernel/src/inet_res.erl >> > >> +++ b/lib/kernel/src/inet_res.erl >> > >> @@ -592,6 +592,7 @@ query_retries(_Q, _NSs, _Timer, Retry, Retry, S) -> >> > >> ?query_retries(Q, NSs, Timer, Retry, I, S0) -> >> > >> ? ? Num = length(NSs), >> > >> ? ? if Num =:= 0 -> >> > >> + ? ? ? ? ? udp_close(S), >> > >> ? ? ? ? ? ?{error,timeout}; >> > >> ? ? ? ?true -> >> > >> ? ? ? ? ? ?case query_nss(Q, NSs, Timer, Retry, I, S0, []) of >> > >> >> > >> This "retry with TCP" trick of yours should really not be necessary >> > >> since inet_res retries with TCP if it gets a truncated UDP answer. >> > >> Have you got some other case when retrying with TCP is essential? >> > >> >> > >> Or, does your DNS server produce a (valid?) result that >> > >> triggers a debug bug in inet_res, causing the decode error, >> > >> triggering the port leak bug, forcing you to retry with TCP? >> > >> >> > >> On Tue, May 25, 2010 at 05:00:39PM -0700, Bob Ippolito wrote: >> > >>> It appears that there may be an inet_udp Port leak in >> > >>> inet_res:resolve/4, our current workaround is to spawn a new process >> > >>> to call this function. We've noticed this primarily for a service that >> > >>> regularly does a UDP DNS query that fails (because the response is too >> > >>> big) and then we retry over TCP. >> > >>> >> > >>> This is what the state of the process looked like when it was leaking ports: >> > >>> >> > >>> (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, >> > >>> erlang:process_info(whereis(dns_gen_server), links)))). >> > >>> 577 >> > >>> (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- >> > >>> lists:filter(fun erlang:is_port/1, element(2, >> > >>> erlang:process_info(whereis(dns_gen_server), links)))]). >> > >>> [{name,"udp_inet"}] >> > >>> >> > >>> The code looked like this, before the workaround was implemented: >> > >>> >> > >>> %% @spec dns(string()) -> [string()] >> > >>> %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. >> > >>> %% ? ? This may return an empty list if there no A records for this Host name. >> > >>> dns(Host) when is_list(Host) -> >> > >>> ? ? dns(Host, fun inet_res:resolve/4). >> > >>> >> > >>> dns(Host, ResolveFun) -> >> > >>> ? ? case ResolveFun(Host, in, a, []) of >> > >>> ? ? ? ? {ok, Msg} -> >> > >>> ? ? ? ? ? ? ips_for_answers(Msg); >> > >>> ? ? ? ? {error, {nxdomain, _}} -> >> > >>> ? ? ? ? ? ? []; >> > >>> ? ? ? ? {error, timeout} -> >> > >>> ? ? ? ? ? ? %% retry with TCP >> > >>> ? ? ? ? ? ? case ResolveFun(Host, in, a, [{usevc, true}]) of >> > >>> ? ? ? ? ? ? ? ? {ok, Msg} -> >> > >>> ? ? ? ? ? ? ? ? ? ? ips_for_answers(Msg); >> > >>> ? ? ? ? ? ? ? ? {error, {nxdomain, _}} -> >> > >>> ? ? ? ? ? ? ? ? ? ? []; >> > >>> ? ? ? ? ? ? ? ? Error = {error, _} -> >> > >>> ? ? ? ? ? ? ? ? ? ? Error >> > >>> ? ? ? ? ? ? end; >> > >>> ? ? ? ? Error = {error, _} -> >> > >>> ? ? ? ? ? ? Error >> > >>> ? ? end. >> > >>> >> > >>> ips_for_answers(Msg) -> >> > >>> ? ? [inet_parse:ntoa(inet_dns:rr(Answer, data)) >> > >>> ? ? ?|| Answer <- inet_dns:msg(Msg, anlist)]. >> > >>> >> > >>> The workaround we used was to call it indirectly with this function, I >> > >>> couldn't find anything in OTP that did the same thing that didn't have >> > >>> local call optimizations. >> > >>> >> > >>> %% @spec process_apply(atom(), atom(), [term()]) -> term() >> > >>> %% @doc erlang:apply(M, F, A) in a temporary process and return the results. >> > >>> process_apply(M,F,A) -> >> > >>> ? ? %% We can't just use rpc here because there's a local call optimization. >> > >>> ? ? Parent = self(), >> > >>> ? ? Fun = fun () -> >> > >>> ? ? ? ? ? ? ? ? ? try >> > >>> ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), erlang:apply(M, F, A)} >> > >>> ? ? ? ? ? ? ? ? ? catch >> > >>> ? ? ? ? ? ? ? ? ? ? ? Class:Reason -> >> > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Stacktrace = erlang:get_stacktrace(), >> > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), Class, Reason, Stacktrace} >> > >>> ? ? ? ? ? ? ? ? ? end >> > >>> ? ? ? ? ? end, >> > >>> ? ? {Pid, Ref} = erlang:spawn_monitor(Fun), >> > >>> ? ? receive >> > >>> ? ? ? ? {Pid, Res} -> >> > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, >> > >>> ? ? ? ? ? ? Res; >> > >>> ? ? ? ? {Pid, Class, Reason, Stacktrace} -> >> > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, >> > >>> ? ? ? ? ? ? erlang:error(erlang:raise(Class, Reason, Stacktrace)); >> > >>> ? ? ? ? {'DOWN', Ref, process, Pid, Reason} -> >> > >>> ? ? ? ? ? ? erlang:exit(Reason) >> > >>> ? ? end. >> > >>> >> > >>> ________________________________________________________________ >> > >>> erlang-bugs (at) erlang.org mailing list. >> > >>> See http://www.erlang.org/faq.html >> > >>> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED >> > >> >> > >> -- >> > >> >> > >> / Raimo Niskanen, Erlang/OTP, Ericsson AB >> > >> >> > > >> > >> > ________________________________________________________________ >> > erlang-bugs (at) erlang.org mailing list. >> > See http://www.erlang.org/faq.html >> > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED >> > >> >> -- >> >> / Raimo Niskanen, Erlang/OTP, Ericsson AB >> >> ________________________________________________________________ >> erlang-bugs (at) erlang.org mailing list. >> See http://www.erlang.org/faq.html >> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > -- > > / Raimo Niskanen, Erlang/OTP, Ericsson AB > From raimo+erlang-bugs@REDACTED Thu May 27 16:26:20 2010 From: raimo+erlang-bugs@REDACTED (Raimo Niskanen) Date: Thu, 27 May 2010 16:26:20 +0200 Subject: [erlang-bugs] R13B04 inet_res:resolve/4 inet_udp Port leak In-Reply-To: References: <20100526090626.GA17931@erix.ericsson.se> <20100527074354.GA5584@erix.ericsson.se> <20100527100248.GA3917@erix.ericsson.se> Message-ID: <20100527142620.GA15167@erix.ericsson.se> On Thu, May 27, 2010 at 07:11:03AM -0700, Bob Ippolito wrote: > I can confirm that the first commit fixes the port leak bug. Great! Then that branch should be complete. The second commit makes your DNS reply message below decode with the TC bit set, which should make inet_res retry with 'usevc' internally, obsoleting your wrapper (hopefully). > > On Thu, May 27, 2010 at 3:02 AM, Raimo Niskanen > wrote: > > I have created a fix for these problems: > > ? ?git fetch git://github.com/RaimoNiskanen/otp.git rn/resolver-leaking-ports > > > > It will be included in 'pu'. > > > > Unfortunately, the second commit eliminates the bug trigger > > for what the first commit fixes. So to test if the bug fix > > is fixing the bug, one should apply the first commit only. > > > > On Thu, May 27, 2010 at 09:43:54AM +0200, Raimo Niskanen wrote: > >> On Wed, May 26, 2010 at 11:59:26AM -0700, Bob Ippolito wrote: > >> > Here's the DNS packet that is being received as a response to the query: > >> > > >> > 1> inet_dns:decode(<<0,1,131,128,0,1,0,60,0,0,0,0,8,109,111,99,104,105,115,118,110, > >> > 3,101,114,108,10,109,111,99,104,105,109,101,100,105,97,3,110, > >> > 101,116,0,0,1,0,1>>). > >> > {error,fmt} > >> > >> Thank you very much! I was about to give you detailed instructions > >> about how to dig that up :-) > >> > >> I have spotted the problem. > >> > >> The DNS reply packet has got the TC (TrunCation) bit set and claims to contain > >> 60 answer records, but actually contains zero. inet_dns expects to find > >> 60 answer records if it says so. This is a hazy part of the > >> DNS specifications and the resolver I tested truncation on did > >> not do this kind of self-contradiction, but it _may_ be allowed > >> by the specification... > >> > >> I regard it as a bug (or at least need-to-fix-problem) in inet_dns > >> since it should be real-world compatible not just specification compatible. > >> It should allow record shortage in a section if the TC bit is set. > >> I'll try to fix it in R14A. > >> > >> Can you try my patch adding a missing udp_close(S) to see > >> if it stops the leaking port problem? That is a more serious bug. > >> > >> > > >> > On Wed, May 26, 2010 at 8:02 AM, Bob Ippolito wrote: > >> > > Well, I'm not sure exactly which scenario is happening because I > >> > > haven't looked at the packets yet, but the manual TCP retry is > >> > > required. > >> > > > >> > > mochi@REDACTED:~$ /mochi/opt/erlang-R13B04/bin/erl > >> > > Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:8:8] [rq:8] > >> > > [async-threads:4] [hipe] [kernel-poll:true] > >> > > > >> > > Eshell V5.7.5 ?(abort with ^G) > >> > > 1> lists:filter(fun erlang:is_port/1, element(2, > >> > > erlang:process_info(self(), links))). > >> > > [] > >> > > 2> inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, []). > >> > > {error,timeout} > >> > > 3> lists:filter(fun erlang:is_port/1, element(2, > >> > > erlang:process_info(self(), links))). > >> > > [#Port<0.514>] > >> > > 4> element(1, inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, [usevc])). > >> > > ok > >> > > > >> > > > >> > > On Wed, May 26, 2010 at 2:06 AM, Raimo Niskanen > >> > > wrote: > >> > >> By reading the code it seems there is a bug when all nameservers > >> > >> return an answer that causes decode errors, or can not be > >> > >> contacted (enetunreach or econnrefused); then an > >> > >> UDP port (or maybe two; one inet and one inet6) is leaked > >> > >> since the inet_res:udp_close/1 is not called. > >> > >> > >> > >> This should be fixed with: > >> > >> > >> > >> diff --git a/lib/kernel/src/inet_res.erl b/lib/kernel/src/inet_res.erl > >> > >> index 9b9e078..3d38a01 100644 > >> > >> --- a/lib/kernel/src/inet_res.erl > >> > >> +++ b/lib/kernel/src/inet_res.erl > >> > >> @@ -592,6 +592,7 @@ query_retries(_Q, _NSs, _Timer, Retry, Retry, S) -> > >> > >> ?query_retries(Q, NSs, Timer, Retry, I, S0) -> > >> > >> ? ? Num = length(NSs), > >> > >> ? ? if Num =:= 0 -> > >> > >> + ? ? ? ? ? udp_close(S), > >> > >> ? ? ? ? ? ?{error,timeout}; > >> > >> ? ? ? ?true -> > >> > >> ? ? ? ? ? ?case query_nss(Q, NSs, Timer, Retry, I, S0, []) of > >> > >> > >> > >> This "retry with TCP" trick of yours should really not be necessary > >> > >> since inet_res retries with TCP if it gets a truncated UDP answer. > >> > >> Have you got some other case when retrying with TCP is essential? > >> > >> > >> > >> Or, does your DNS server produce a (valid?) result that > >> > >> triggers a debug bug in inet_res, causing the decode error, > >> > >> triggering the port leak bug, forcing you to retry with TCP? > >> > >> > >> > >> On Tue, May 25, 2010 at 05:00:39PM -0700, Bob Ippolito wrote: > >> > >>> It appears that there may be an inet_udp Port leak in > >> > >>> inet_res:resolve/4, our current workaround is to spawn a new process > >> > >>> to call this function. We've noticed this primarily for a service that > >> > >>> regularly does a UDP DNS query that fails (because the response is too > >> > >>> big) and then we retry over TCP. > >> > >>> > >> > >>> This is what the state of the process looked like when it was leaking ports: > >> > >>> > >> > >>> (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, > >> > >>> erlang:process_info(whereis(dns_gen_server), links)))). > >> > >>> 577 > >> > >>> (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- > >> > >>> lists:filter(fun erlang:is_port/1, element(2, > >> > >>> erlang:process_info(whereis(dns_gen_server), links)))]). > >> > >>> [{name,"udp_inet"}] > >> > >>> > >> > >>> The code looked like this, before the workaround was implemented: > >> > >>> > >> > >>> %% @spec dns(string()) -> [string()] > >> > >>> %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. > >> > >>> %% ? ? This may return an empty list if there no A records for this Host name. > >> > >>> dns(Host) when is_list(Host) -> > >> > >>> ? ? dns(Host, fun inet_res:resolve/4). > >> > >>> > >> > >>> dns(Host, ResolveFun) -> > >> > >>> ? ? case ResolveFun(Host, in, a, []) of > >> > >>> ? ? ? ? {ok, Msg} -> > >> > >>> ? ? ? ? ? ? ips_for_answers(Msg); > >> > >>> ? ? ? ? {error, {nxdomain, _}} -> > >> > >>> ? ? ? ? ? ? []; > >> > >>> ? ? ? ? {error, timeout} -> > >> > >>> ? ? ? ? ? ? %% retry with TCP > >> > >>> ? ? ? ? ? ? case ResolveFun(Host, in, a, [{usevc, true}]) of > >> > >>> ? ? ? ? ? ? ? ? {ok, Msg} -> > >> > >>> ? ? ? ? ? ? ? ? ? ? ips_for_answers(Msg); > >> > >>> ? ? ? ? ? ? ? ? {error, {nxdomain, _}} -> > >> > >>> ? ? ? ? ? ? ? ? ? ? []; > >> > >>> ? ? ? ? ? ? ? ? Error = {error, _} -> > >> > >>> ? ? ? ? ? ? ? ? ? ? Error > >> > >>> ? ? ? ? ? ? end; > >> > >>> ? ? ? ? Error = {error, _} -> > >> > >>> ? ? ? ? ? ? Error > >> > >>> ? ? end. > >> > >>> > >> > >>> ips_for_answers(Msg) -> > >> > >>> ? ? [inet_parse:ntoa(inet_dns:rr(Answer, data)) > >> > >>> ? ? ?|| Answer <- inet_dns:msg(Msg, anlist)]. > >> > >>> > >> > >>> The workaround we used was to call it indirectly with this function, I > >> > >>> couldn't find anything in OTP that did the same thing that didn't have > >> > >>> local call optimizations. > >> > >>> > >> > >>> %% @spec process_apply(atom(), atom(), [term()]) -> term() > >> > >>> %% @doc erlang:apply(M, F, A) in a temporary process and return the results. > >> > >>> process_apply(M,F,A) -> > >> > >>> ? ? %% We can't just use rpc here because there's a local call optimization. > >> > >>> ? ? Parent = self(), > >> > >>> ? ? Fun = fun () -> > >> > >>> ? ? ? ? ? ? ? ? ? try > >> > >>> ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), erlang:apply(M, F, A)} > >> > >>> ? ? ? ? ? ? ? ? ? catch > >> > >>> ? ? ? ? ? ? ? ? ? ? ? Class:Reason -> > >> > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Stacktrace = erlang:get_stacktrace(), > >> > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), Class, Reason, Stacktrace} > >> > >>> ? ? ? ? ? ? ? ? ? end > >> > >>> ? ? ? ? ? end, > >> > >>> ? ? {Pid, Ref} = erlang:spawn_monitor(Fun), > >> > >>> ? ? receive > >> > >>> ? ? ? ? {Pid, Res} -> > >> > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, > >> > >>> ? ? ? ? ? ? Res; > >> > >>> ? ? ? ? {Pid, Class, Reason, Stacktrace} -> > >> > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, > >> > >>> ? ? ? ? ? ? erlang:error(erlang:raise(Class, Reason, Stacktrace)); > >> > >>> ? ? ? ? {'DOWN', Ref, process, Pid, Reason} -> > >> > >>> ? ? ? ? ? ? erlang:exit(Reason) > >> > >>> ? ? end. > >> > >>> > >> > >>> ________________________________________________________________ > >> > >>> erlang-bugs (at) erlang.org mailing list. > >> > >>> See http://www.erlang.org/faq.html > >> > >>> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > >> > >> > >> > >> -- > >> > >> > >> > >> / Raimo Niskanen, Erlang/OTP, Ericsson AB > >> > >> > >> > > > >> > > >> > ________________________________________________________________ > >> > erlang-bugs (at) erlang.org mailing list. > >> > See http://www.erlang.org/faq.html > >> > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > >> > > >> > >> -- > >> > >> / Raimo Niskanen, Erlang/OTP, Ericsson AB > >> > >> ________________________________________________________________ > >> erlang-bugs (at) erlang.org mailing list. > >> See http://www.erlang.org/faq.html > >> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > > > -- > > > > / Raimo Niskanen, Erlang/OTP, Ericsson AB > > > > ________________________________________________________________ > erlang-bugs (at) erlang.org mailing list. > See http://www.erlang.org/faq.html > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > -- / Raimo Niskanen, Erlang/OTP, Ericsson AB From vlm@REDACTED Sat May 29 02:49:11 2010 From: vlm@REDACTED (Lev Walkin) Date: Fri, 28 May 2010 17:49:11 -0700 Subject: http:request memory leak in R12B04 In-Reply-To: <87d3wf69a3.fsf@cronqvi.st> References: <484B8939-A4BD-4F95-986B-973C82F84D54@lionet.info> <87d3wf69a3.fsf@cronqvi.st> Message-ID: <34B378AB-7963-4819-8AA8-C81AD913648E@lionet.info> No idea. I haven't received any follow-up, and the available code base on git never changed in this respect, as far as I can tell. We are running with a custom version on our servers and thinking of dropping Erlang in favor of PHP. On May 28, 2010, at 1:50 PM, mats cronqvist wrote: > Lev Walkin writes: > > what's up with this? renders httpc unusable, as far as I can tell. > > mats > >> The R12B04 release brought a reliable memory leak to http:request >> that >> was never in there before. -- vlm From vlm@REDACTED Sat May 29 06:46:40 2010 From: vlm@REDACTED (Lev Walkin) Date: Fri, 28 May 2010 21:46:40 -0700 Subject: [erlang-bugs] Re: http:request memory leak in R12B04 In-Reply-To: <34B378AB-7963-4819-8AA8-C81AD913648E@lionet.info> References: <484B8939-A4BD-4F95-986B-973C82F84D54@lionet.info> <87d3wf69a3.fsf@cronqvi.st> <34B378AB-7963-4819-8AA8-C81AD913648E@lionet.info> Message-ID: I must correct myself here. Just checked the OTP source code and it turned out the bug was finally fixed in inets-5.3.2, roughly in accord with the patch we've submitted on March 22. Here's the proof: http://github.com/erlang/otp/commit/91c89d54d45989a85367f10d5902b9b508754a49 On May 28, 2010, at 5:49 PM, Lev Walkin wrote: > > No idea. I haven't received any follow-up, and the available code > base on git never changed in this respect, as far as I can tell. > > We are running with a custom version on our servers and thinking of > dropping Erlang in favor of PHP. > > > On May 28, 2010, at 1:50 PM, mats cronqvist wrote: > >> Lev Walkin writes: >> >> what's up with this? renders httpc unusable, as far as I can tell. >> >> mats >> >>> The R12B04 release brought a reliable memory leak to http:request >>> that >>> was never in there before. > > -- > vlm > > > ________________________________________________________________ > erlang-bugs (at) erlang.org mailing list. > See http://www.erlang.org/faq.html > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > -- vlm From kostis@REDACTED Sat May 29 14:35:59 2010 From: kostis@REDACTED (Kostis Sagonas) Date: Sat, 29 May 2010 15:35:59 +0300 Subject: Inviso Message-ID: <4C010A2F.30004@cs.ntua.gr> Ulf Wiger suggested a cleanup of the 'inviso' application by tidier and I've already done that. I'll submit a patch via github early next week. Before starting, I've already noticed that dialyzer complains that: inviso_tool.erl:586: The pattern {'error', Reason} can never match the type {'ok',#ld{...}} which refers to the init/1 function: init(Config) -> case fetch_configuration(Config) of % From conf-file and Config. {ok,#ld{}=LD} -> case start_inviso_at_c_node(LD) of ... end; {error,Reason} -> {stop,{error,{start_up,Reason}}} end. due to the fact that fetch_configuration/1 returns {ok,...} in all its branches -- even in the error case: fetch_configuration(Config) -> case fetch_config_filename(Config) of {ok,FName} -> % We are supposed to use a conf-file. case read_config_file(FName) of {ok,LD} -> % Managed to open a file. NewLD=read_config_list(LD,Config), {ok,NewLD}; {error,_Reason} -> % Problem finding/opening file. LD=read_config_list(#ld{},Config), {ok,LD} end; false -> % No filename specified. LD=read_config_list(#ld{},Config), {ok,LD} end. Dialyzer is right, but the question is how should this one be fixed? Simply by taking the {error,Reason} case from init/1 or by returning {error,Reason} in the {error,Reason} case of fetch_configuration/1? Kostis From kostis@REDACTED Sat May 29 20:41:25 2010 From: kostis@REDACTED (Kostis Sagonas) Date: Sat, 29 May 2010 21:41:25 +0300 Subject: [erlang-bugs] Inviso In-Reply-To: <4C010A2F.30004@cs.ntua.gr> References: <4C010A2F.30004@cs.ntua.gr> Message-ID: <4C015FD5.6070706@cs.ntua.gr> Some more confusion in inviso. There is a record definition which reads: %% The loopdata record. -record(ld,{... session_state=passive, % passive | tracing ...}). leading one to believe that this field is to be assigned the values 'passive' or 'tracing'. Yet, on line 844 there is an assignment: LD#ld{session_state=passive_sessionstate(), nodes=NewNodesD, .... The problem is that the definition of passive_sessionstate/0 reads (comment is actually from the code - line 2962): %% Returns the correct value indicating that the tool is not tracing. passive_sessionstate() -> idle. Does anybody know which are the values that this field can have? Kostis From raimo+erlang-bugs@REDACTED Mon May 31 14:46:13 2010 From: raimo+erlang-bugs@REDACTED (Raimo Niskanen) Date: Mon, 31 May 2010 14:46:13 +0200 Subject: [erlang-bugs] R13B04 inet_res:resolve/4 inet_udp Port leak In-Reply-To: <20100527142620.GA15167@erix.ericsson.se> References: <20100526090626.GA17931@erix.ericsson.se> <20100527074354.GA5584@erix.ericsson.se> <20100527100248.GA3917@erix.ericsson.se> <20100527142620.GA15167@erix.ericsson.se> Message-ID: <20100531124613.GA15756@erix.ericsson.se> On Thu, May 27, 2010 at 04:26:20PM +0200, Raimo Niskanen wrote: > On Thu, May 27, 2010 at 07:11:03AM -0700, Bob Ippolito wrote: > > I can confirm that the first commit fixes the port leak bug. > > Great! Then that branch should be complete. The second > commit makes your DNS reply message below decode with > the TC bit set, which should make inet_res retry with > 'usevc' internally, obsoleting your wrapper (hopefully). 4aa2ead3149d3727ec6ad67b653ff51c74405671 New commit. The previous only worked for your special case. What is not tested does not work... It will be included in 'pu'. > > > > > On Thu, May 27, 2010 at 3:02 AM, Raimo Niskanen > > wrote: > > > I have created a fix for these problems: > > > ? ?git fetch git://github.com/RaimoNiskanen/otp.git rn/resolver-leaking-ports > > > > > > It will be included in 'pu'. > > > > > > Unfortunately, the second commit eliminates the bug trigger > > > for what the first commit fixes. So to test if the bug fix > > > is fixing the bug, one should apply the first commit only. > > > > > > On Thu, May 27, 2010 at 09:43:54AM +0200, Raimo Niskanen wrote: > > >> On Wed, May 26, 2010 at 11:59:26AM -0700, Bob Ippolito wrote: > > >> > Here's the DNS packet that is being received as a response to the query: > > >> > > > >> > 1> inet_dns:decode(<<0,1,131,128,0,1,0,60,0,0,0,0,8,109,111,99,104,105,115,118,110, > > >> > 3,101,114,108,10,109,111,99,104,105,109,101,100,105,97,3,110, > > >> > 101,116,0,0,1,0,1>>). > > >> > {error,fmt} > > >> > > >> Thank you very much! I was about to give you detailed instructions > > >> about how to dig that up :-) > > >> > > >> I have spotted the problem. > > >> > > >> The DNS reply packet has got the TC (TrunCation) bit set and claims to contain > > >> 60 answer records, but actually contains zero. inet_dns expects to find > > >> 60 answer records if it says so. This is a hazy part of the > > >> DNS specifications and the resolver I tested truncation on did > > >> not do this kind of self-contradiction, but it _may_ be allowed > > >> by the specification... > > >> > > >> I regard it as a bug (or at least need-to-fix-problem) in inet_dns > > >> since it should be real-world compatible not just specification compatible. > > >> It should allow record shortage in a section if the TC bit is set. > > >> I'll try to fix it in R14A. > > >> > > >> Can you try my patch adding a missing udp_close(S) to see > > >> if it stops the leaking port problem? That is a more serious bug. > > >> > > >> > > > >> > On Wed, May 26, 2010 at 8:02 AM, Bob Ippolito wrote: > > >> > > Well, I'm not sure exactly which scenario is happening because I > > >> > > haven't looked at the packets yet, but the manual TCP retry is > > >> > > required. > > >> > > > > >> > > mochi@REDACTED:~$ /mochi/opt/erlang-R13B04/bin/erl > > >> > > Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:8:8] [rq:8] > > >> > > [async-threads:4] [hipe] [kernel-poll:true] > > >> > > > > >> > > Eshell V5.7.5 ?(abort with ^G) > > >> > > 1> lists:filter(fun erlang:is_port/1, element(2, > > >> > > erlang:process_info(self(), links))). > > >> > > [] > > >> > > 2> inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, []). > > >> > > {error,timeout} > > >> > > 3> lists:filter(fun erlang:is_port/1, element(2, > > >> > > erlang:process_info(self(), links))). > > >> > > [#Port<0.514>] > > >> > > 4> element(1, inet_res:resolve("mochisvn.erl.mochimedia.net", in, a, [usevc])). > > >> > > ok > > >> > > > > >> > > > > >> > > On Wed, May 26, 2010 at 2:06 AM, Raimo Niskanen > > >> > > wrote: > > >> > >> By reading the code it seems there is a bug when all nameservers > > >> > >> return an answer that causes decode errors, or can not be > > >> > >> contacted (enetunreach or econnrefused); then an > > >> > >> UDP port (or maybe two; one inet and one inet6) is leaked > > >> > >> since the inet_res:udp_close/1 is not called. > > >> > >> > > >> > >> This should be fixed with: > > >> > >> > > >> > >> diff --git a/lib/kernel/src/inet_res.erl b/lib/kernel/src/inet_res.erl > > >> > >> index 9b9e078..3d38a01 100644 > > >> > >> --- a/lib/kernel/src/inet_res.erl > > >> > >> +++ b/lib/kernel/src/inet_res.erl > > >> > >> @@ -592,6 +592,7 @@ query_retries(_Q, _NSs, _Timer, Retry, Retry, S) -> > > >> > >> ?query_retries(Q, NSs, Timer, Retry, I, S0) -> > > >> > >> ? ? Num = length(NSs), > > >> > >> ? ? if Num =:= 0 -> > > >> > >> + ? ? ? ? ? udp_close(S), > > >> > >> ? ? ? ? ? ?{error,timeout}; > > >> > >> ? ? ? ?true -> > > >> > >> ? ? ? ? ? ?case query_nss(Q, NSs, Timer, Retry, I, S0, []) of > > >> > >> > > >> > >> This "retry with TCP" trick of yours should really not be necessary > > >> > >> since inet_res retries with TCP if it gets a truncated UDP answer. > > >> > >> Have you got some other case when retrying with TCP is essential? > > >> > >> > > >> > >> Or, does your DNS server produce a (valid?) result that > > >> > >> triggers a debug bug in inet_res, causing the decode error, > > >> > >> triggering the port leak bug, forcing you to retry with TCP? > > >> > >> > > >> > >> On Tue, May 25, 2010 at 05:00:39PM -0700, Bob Ippolito wrote: > > >> > >>> It appears that there may be an inet_udp Port leak in > > >> > >>> inet_res:resolve/4, our current workaround is to spawn a new process > > >> > >>> to call this function. We've noticed this primarily for a service that > > >> > >>> regularly does a UDP DNS query that fails (because the response is too > > >> > >>> big) and then we retry over TCP. > > >> > >>> > > >> > >>> This is what the state of the process looked like when it was leaking ports: > > >> > >>> > > >> > >>> (node@REDACTED)1> length(lists:filter(fun erlang:is_port/1, element(2, > > >> > >>> erlang:process_info(whereis(dns_gen_server), links)))). > > >> > >>> 577 > > >> > >>> (node@REDACTED)2> lists:usort([erlang:port_info(P, name) || P <- > > >> > >>> lists:filter(fun erlang:is_port/1, element(2, > > >> > >>> erlang:process_info(whereis(dns_gen_server), links)))]). > > >> > >>> [{name,"udp_inet"}] > > >> > >>> > > >> > >>> The code looked like this, before the workaround was implemented: > > >> > >>> > > >> > >>> %% @spec dns(string()) -> [string()] > > >> > >>> %% @doc Return the A records (IPv4 IPs) as strings for the given Host name. > > >> > >>> %% ? ? This may return an empty list if there no A records for this Host name. > > >> > >>> dns(Host) when is_list(Host) -> > > >> > >>> ? ? dns(Host, fun inet_res:resolve/4). > > >> > >>> > > >> > >>> dns(Host, ResolveFun) -> > > >> > >>> ? ? case ResolveFun(Host, in, a, []) of > > >> > >>> ? ? ? ? {ok, Msg} -> > > >> > >>> ? ? ? ? ? ? ips_for_answers(Msg); > > >> > >>> ? ? ? ? {error, {nxdomain, _}} -> > > >> > >>> ? ? ? ? ? ? []; > > >> > >>> ? ? ? ? {error, timeout} -> > > >> > >>> ? ? ? ? ? ? %% retry with TCP > > >> > >>> ? ? ? ? ? ? case ResolveFun(Host, in, a, [{usevc, true}]) of > > >> > >>> ? ? ? ? ? ? ? ? {ok, Msg} -> > > >> > >>> ? ? ? ? ? ? ? ? ? ? ips_for_answers(Msg); > > >> > >>> ? ? ? ? ? ? ? ? {error, {nxdomain, _}} -> > > >> > >>> ? ? ? ? ? ? ? ? ? ? []; > > >> > >>> ? ? ? ? ? ? ? ? Error = {error, _} -> > > >> > >>> ? ? ? ? ? ? ? ? ? ? Error > > >> > >>> ? ? ? ? ? ? end; > > >> > >>> ? ? ? ? Error = {error, _} -> > > >> > >>> ? ? ? ? ? ? Error > > >> > >>> ? ? end. > > >> > >>> > > >> > >>> ips_for_answers(Msg) -> > > >> > >>> ? ? [inet_parse:ntoa(inet_dns:rr(Answer, data)) > > >> > >>> ? ? ?|| Answer <- inet_dns:msg(Msg, anlist)]. > > >> > >>> > > >> > >>> The workaround we used was to call it indirectly with this function, I > > >> > >>> couldn't find anything in OTP that did the same thing that didn't have > > >> > >>> local call optimizations. > > >> > >>> > > >> > >>> %% @spec process_apply(atom(), atom(), [term()]) -> term() > > >> > >>> %% @doc erlang:apply(M, F, A) in a temporary process and return the results. > > >> > >>> process_apply(M,F,A) -> > > >> > >>> ? ? %% We can't just use rpc here because there's a local call optimization. > > >> > >>> ? ? Parent = self(), > > >> > >>> ? ? Fun = fun () -> > > >> > >>> ? ? ? ? ? ? ? ? ? try > > >> > >>> ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), erlang:apply(M, F, A)} > > >> > >>> ? ? ? ? ? ? ? ? ? catch > > >> > >>> ? ? ? ? ? ? ? ? ? ? ? Class:Reason -> > > >> > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Stacktrace = erlang:get_stacktrace(), > > >> > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? Parent ! {self(), Class, Reason, Stacktrace} > > >> > >>> ? ? ? ? ? ? ? ? ? end > > >> > >>> ? ? ? ? ? end, > > >> > >>> ? ? {Pid, Ref} = erlang:spawn_monitor(Fun), > > >> > >>> ? ? receive > > >> > >>> ? ? ? ? {Pid, Res} -> > > >> > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, > > >> > >>> ? ? ? ? ? ? Res; > > >> > >>> ? ? ? ? {Pid, Class, Reason, Stacktrace} -> > > >> > >>> ? ? ? ? ? ? receive {'DOWN', Ref, process, Pid, _} -> ok end, > > >> > >>> ? ? ? ? ? ? erlang:error(erlang:raise(Class, Reason, Stacktrace)); > > >> > >>> ? ? ? ? {'DOWN', Ref, process, Pid, Reason} -> > > >> > >>> ? ? ? ? ? ? erlang:exit(Reason) > > >> > >>> ? ? end. > > >> > >>> > > >> > >>> ________________________________________________________________ > > >> > >>> erlang-bugs (at) erlang.org mailing list. > > >> > >>> See http://www.erlang.org/faq.html > > >> > >>> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > >> > >> > > >> > >> -- > > >> > >> > > >> > >> / Raimo Niskanen, Erlang/OTP, Ericsson AB > > >> > >> > > >> > > > > >> > > > >> > ________________________________________________________________ > > >> > erlang-bugs (at) erlang.org mailing list. > > >> > See http://www.erlang.org/faq.html > > >> > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > >> > > > >> > > >> -- > > >> > > >> / Raimo Niskanen, Erlang/OTP, Ericsson AB > > >> > > >> ________________________________________________________________ > > >> erlang-bugs (at) erlang.org mailing list. > > >> See http://www.erlang.org/faq.html > > >> To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > > > > > -- > > > > > > / Raimo Niskanen, Erlang/OTP, Ericsson AB > > > > > > > ________________________________________________________________ > > erlang-bugs (at) erlang.org mailing list. > > See http://www.erlang.org/faq.html > > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED > > > > -- > > / Raimo Niskanen, Erlang/OTP, Ericsson AB > > ________________________________________________________________ > erlang-bugs (at) erlang.org mailing list. > See http://www.erlang.org/faq.html > To unsubscribe; mailto:erlang-bugs-unsubscribe@REDACTED -- / Raimo Niskanen, Erlang/OTP, Ericsson AB