[erlang-questions] Building a Non-blocking TCP server using OTP principles

Fri Aug 17 15:04:29 CEST 2007

Samuel,

Thanks for your input.  I reviewed your modifications and have to say 
that there are several problems with this approach.

1. Not doing asynchronous accept and relying on a separate process to 
accept connections may be *dangerous* if not handled properly as it 
introduces race conditions that could potentially block the server 
permanently.

Here's an important quote from ACE book "C++ Network Programming Vol.1":

"When an acceptor socket is passed to select(), it's marked as "active" 
when a connection is received. Many servers use this event to indicate 
that it's OK to call accept() without blocking. Unfortunately, there's a 
race condition that stems from the asynchronous behavior of TCP/IP In 
particular, after select() indicates an acceptor socket is active (but 
before accept() is called) a client can close its connection, whereupon 
accept() can block and potentially hang the entire application process. 
To avoid this problem, acceptor sockets should always be set into 
non-blocking mode when used with select()."

This applies to your changes indirectly.  Under the hood of the network 
driver, it still does the asynchronous accept, so the paragraph above 
doesn't apply at the driver level.  However, there may be a failure 
between these two lines in the init/1:

         {ok, Ref} = create_acceptor(Listen_socket),
         {ok, #state{listener = Listen_socket,
                     acceptor = Ref,
                     module   = Module}};

due to various reasons and despite the fact that it was linked, the 
{'EXIT', Pid, Reason} message is presently not handled (trap_exit though 
is turned on), so the process will be locked forever.

The same can happen if the acceptor process dies anywhere in the middle 
of the F() function:

     F = fun() ->
                 {ok, Socket} = gen_tcp:accept(Listener),
                 gen_tcp:controlling_process(Socket, Self),
                 gen_server:call(Self, {accept, Socket})
         end,

As mentioned above, this can likely be fixed by proper handling of the 
{'EXIT', Pid, Reason} and respawning acceptor when it happens.  This, 
however presents another challenge - what if the system runs out of file 
descriptors - your listener process will be in an unhappy more of 
constantly respawning acceptors that will die because of this line:

                 {ok, Socket} = gen_tcp:accept(Listener)

So you would need to monitor how many accept failures you got in the 
last several seconds and do some intelligent recovery.  This would 
complicate code by quite a bit.

2. This new process is not OTP compliant - no supervisors know about it 
and it doesn't process debug and system messages as per "6.2 Special 
Processes" of Design Principles.  This means that you may have problems 
when you upgrade your system dynamically.

Partly these are some of the reasons I put together this tutorial to 
show how to avoid such problems all together.  :-)

I hope you will find this feedback useful.

Regards,

Serge

Samuel Tesla wrote:
> Serge,
> 
> I really got a lot from your guide on building TCP servers.  I really
> appreciate the work you put into it.  I think I've got an improvement that
> you may want to consider putting up on the website.
> 
> I wanted to read documentation for prim_inet:async_accept/2 so I could
> figure out what that -1 was for, and couldn't find any documentation.  So, I
> Googled and discovered that there is no documentation on purpose (
> http://www.trapexit.org/forum/viewtopic.php?p=29157).  Basically, it's not a
> guaranteed API between versions, whereas gen_tcp is.  So, I set out to see
> if I could use gen_tcp:accept/1 instead of prim_inet:async_accept/2, and I
> was successful.
> 
> I copied your source off the website and then made modifications.  I only
> had to change the listener and the FSM modules, and I've attached the
> altered source files.  The gist of what I did was spawn a linked process
> which does the accept, and then sends a call back to the main listener
> process.  The whole sequence until the control has to be synchronous until
> the FSM gets into WAIT_FOR_DATA or the socket will disconnect and you'll
> start getting posix errors.
> 
> There were a few other things I cleaned up or changed:
>  * You don't need to copy socket options, as accept/1 does that.
>  * You don't need to call gen_tcp:close/1 in terminate/2 as the listening
> socket will close when its controlling process exits.
>  * I set {packet, 0} as I was testing with a raw telnet session.
> 
> I hope you find this helpful!
> 
> -- Samuel
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tcp_listener.erl
Type: application/octet-stream
Size: 5729 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20070817/f284b0fb/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tcp_echo_fsm.erl
Type: application/octet-stream
Size: 6086 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20070817/f284b0fb/attachment-0001.obj>