[erlang-questions] eep: New gen_stream module

Jay Nelson jay@REDACTED
Mon Dec 10 07:20:00 CET 2007


EEP: XXX
Title: gen_stream behaviour
Version: $Revision: 14 $
Last-Modified: $Date: 2007-12-10 07:17:01 +0200 (Mon, 10 Dec 2007) $
Author: Jay Nelson <jay at duomark.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 09-Dec-2007
Erlang-Version: R12B-2
Post-History: 09-Dec-2007


Abstract

     An optimized behaviour module is needed to simplify the handling of
     large streams of (typically binary) data for application  
developers.


Specification

     Module name:
         gen_stream

     Implementation:
         A gen_server which delivers "chunks" of the stream in an  
efficient
         manner so that line-oriented processing or the handling of  
streams
         much bigger than memory (possibly even infinite) may be  
absorbed by
         an application.

     Behaviour callbacks:
         start, start_link as in gen_server

         init(Args, Options) -> Same as gen_server plus list of Options:

             {stream, {file, path_to_file()} |
                      {binary, binary()} |
                      {behaviour, atom(), ExtraArgs}}

             {chunk_size, integer()} returned sub-binary size,  
default is ~8K
             {chunks_per_proc, integer()} num of internal chunks,  
default is 1
             {circular, false | true} whether stream repeats, default  
is false
	    {num_processes, integer()} num_processes used, default 1

         next_chunk(Server::pid()) -> binary() | end_of_stream
         pct_complete(Server::pid()) -> integer() | atom()
         stream_size(Server::pid()) -> integer() | atom()
         stream_pos(Server::pid()) -> integer()
         stop(Server::pid()) -> ok

     Usage:
         Client starts the gen_stream by providing at least a stream
         option.  The stream option indicates whether the source of the
         stream is a file, a binary or a function.  When using a
         socket, port or other source, the client needs to implement
         the behavior to feed the buffers on demand.

Motivation

     There are many ways to get binary data into an erlang node,
     however, historically it has been recommended that the data be
     converted to a list and processed.  There are many situations
     where leaving the binary data in its original form is preferable
     for space or conversion efficiency reasons (e.g., when merely
     filtering data in a relaying router process or when performing
     statistics on raw stream data).  Providing a gen_server idiom
     makes the default approach to processing a binary stream an
     abstraction that is closer to an application developer's view of
     the problem solution.

     The recent Wide Finder project [1] challenged the erlang  
community by
     highlighting the slowness of standard I/O functions, forcing
     developers to use raw binary handling.  This approach seems to be a
     common need in web service applications, yet it is quite easy to  
do in
     a very inefficient manner.  Providing a reference implementation  
that
     exposes a simpler behaviour interface would increase the class of
     problems that erlang can solve in the hands of beginning to
     intermediate developers.  It would also push implementers in the
     direction of an OTP compliant application without sacrificing
     efficiency.

     In addition, there has been a call on the email list for a
     string_stream implementation so that a buffer of data (e.g., an  
SMTP
     message, HTTP request, HTML page, multi-record socket protocol  
packet,
     raw text database, comma-delimited file, etc.) could be treated  
as a
     stream of binary elements rather than a single block of data.

     Finally, testing systems often need a generative source of data  
that
     can be replayed or repeated in a precise manner to trigger a  
fault or
     test a patch to same.  The circular binary stream allows infinite
     streams of generative data, and the behaviour stream allows a
     functionally generated stream of data to be emitted.


Rationale

     There are a few common idioms that are used when efficiently
     handling a binary data source:

         1) "Chunking" the data to smaller sub-binaries
         2) Buffering the chunks for efficient I/O
         3) Few of the standard idioms are OTP-compliant

     A gen_server implementation seemed the most straight-forward
     method for making an OTP-compliant method for chunking a serial
     stream.  A behaviour was created so that streams could be computed
     and generated rather than requiring a pre-constructed file or
     binary as a source.


Reference Implementation

     A working version is available at the DuoMark Website [2].


References

     [1] Tim Bray's weblog
         http://www.tbray.org/ongoing/

     [2] http://www.duomark.com/erlang/proposals/gen_stream.html

Copyright

     This document is released to the public domain.



Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:




More information about the erlang-questions mailing list