[erlang-questions] Erlang socket doesn't receive until the second setopts {active, once}

Wed Oct 19 23:44:09 CEST 2011

First I want to apologize for also posting this question to stack
overflow. I'm not sure if it's bad form to post a question to multiple
places, but I had already posted it to SO when it occurred to me that
I should just ask the mailing list directly.

Anyway

(Running erlang R14B04, kernel 2.6.18-194, centos 5.5)

I have a very strange problem. I have the following code to listen and
process sockets:

%Opts used to make listen socket
-define(TCP_OPTS, [binary, {packet, raw}, {nodelay, true}, {reuseaddr,
true}, {active, false},{keepalive,true}]).

%Acceptor loop which spawns off sock processors when connections
%come in
accept_loop(Listen) ->
    case gen_tcp:accept(Listen) of
    {ok, Socket} ->
        Pid = spawn(fun()->?MODULE:process_sock(Socket) end),
        gen_tcp:controlling_process(Socket,Pid);
    {error,_} -> do_nothing
    end,
    ?MODULE:accept_loop(Listen).

%Probably not relevant
process_sock(Sock) ->
    case inet:peername(Sock) of
    {ok,{Ip,_Port}} ->
        case Ip of
        {172,16,_,_} -> Auth = true;
        _ -> Auth = lists:member(Ip,?PUB_IPS)
        end,
        ?MODULE:process_sock_loop(Sock,Auth);
    _ -> gen_tcp:close(Sock)
    end.

process_sock_loop(Sock,Auth) ->
    try inet:setopts(Sock,[{active,once}]) of
    ok ->
        receive
        {tcp_closed,_} ->
            ?MODULE:prepare_for_death(Sock,[]);
        {tcp_error,_,etimedout} ->
            ?MODULE:prepare_for_death(Sock,[]);

        %Not getting here
        {tcp,Sock,Data} ->
            ?MODULE:do_stuff(Sock,Data);

        {die,From} ->
            From!{ok,[]};
        _ ->
            ?MODULE:process_sock_loop(Sock,Auth)
        after 60000 ->
            ?MODULE:process_sock_loop(Sock,Auth)
        end;
    {error,_} ->
        ?MODULE:prepare_for_death(Sock,[])
    catch _:_ ->
        ?MODULE:prepare_for_death(Sock,[])
    end.

This whole setup works wonderfully normally, and has been working for
the past few months. The server operates as a message passing server
with long-held tcp connections, and it holds on average about 100k
connections. However now we're trying to use the server more heavily.
We're making two long-held connections (in the future probably more)
to the erlang server and making a few hundred commands every second
per each of those connections. Each of those commands, in the common
case, spawn off a new thread which will probably make some kind of
read from mnesia, and send some messages based on that.

The strangeness comes when we try to test those two command
connections. When we turn on the stream of commands, any new
connection has about 50% chance of hanging. For instance, using netcat
if I were to connect and send along the string "blahblahblah" the
server should immediately return back an error. In doing this it won't
make any calls outside the thread (since all it's doing is trying to
parse the command, which will fail because blahblahblah isn't a
command). But about 50% of the time (when the two command connections
are running) typing in blahblahblah results in the server just sitting
there for 60 seconds before returning that error.

In trying to debug this I pulled up wireshark. The tcp handshake
always happens immediately, and when the first packet from the client
(netcat) is sent it acks immediately, telling me that the tcp stack of
the kernel isn't the bottleneck. My only guess is that the problem
lies in the process_sock_loop function. It has a receive which will go
back to the top of the function after 60 seconds and try again to get
more from the socket. My best guess is that the following is
happening:

- Connection is made, thread moves on to process_sock_loop
- {active,once} is set
- Thread receives, but doesn't get data even though it's there
- After 60 seconds thread goes back to the top of process_sock_loop
- {active, once} is set again
- This time the data comes through, things proceed as normal

Why this would be I have no idea, and when we turn those two command
connections off everything goes back to normal and the problem goes
away.

Please let me know if there's any other information I could give that
might help!

 - Brian