[erlang-questions] eep: New gen_stream module

Per Gustafsson per.gustafsson@REDACTED
Mon Dec 10 13:21:05 CET 2007


This is a very good idea, but there are some additional things that I 
would like to have and some things which are unclear how it would work. 
These comments are inline in the eep text

Jay Nelson wrote:
> EEP: XXX
> Title: gen_stream behaviour
> Version: $Revision: 14 $
> Last-Modified: $Date: 2007-12-10 07:17:01 +0200 (Mon, 10 Dec 2007) $
> Author: Jay Nelson <jay at duomark.com>
> Status: Draft
> Type: Standards Track
> Content-Type: text/plain
> Created: 09-Dec-2007
> Erlang-Version: R12B-2
> Post-History: 09-Dec-2007
> 
> 
> Abstract
> 
>      An optimized behaviour module is needed to simplify the handling of
>      large streams of (typically binary) data for application  
> developers.
> 
> 
> Specification
> 
>      Module name:
>          gen_stream
> 
>      Implementation:
>          A gen_server which delivers "chunks" of the stream in an  
> efficient
>          manner so that line-oriented processing or the handling of  
> streams
>          much bigger than memory (possibly even infinite) may be  
> absorbed by
>          an application.
> 
>      Behaviour callbacks:

These are not really the behaviour callbacks, but rather the interface 
to the gen_stream module, I was a little bit confused by this at first, 
but the code seems to indicate that the actual callbacks for a 
gen_stream behaviour is:

init/3,
terminate/1,
stream_length/0,
stream_length/1,
extract_block/3,
extract_split_block/4,
extract_final_block/3,
inc_progress/2

I guess that the eep also needs to define what these functions should do 
to make it possible to define gen_stream behaviours


>          start, start_link as in gen_server
> 
>          init(Args, Options) -> Same as gen_server plus list of Options:
> 
>              {stream, {file, path_to_file()} |
>                       {binary, binary()} |
>                       {behaviour, atom(), ExtraArgs}}

I think it would be nice to add a fourth lightweight option:

{generator, fun(() -> {binary(), fun()} | end_of_stream)}

That is a fun which returns a binary and a new fun which will produce 
the next chunk or an end_of_stream marker, but this might not fit with 
the OTP framework

> 
>              {chunk_size, integer()} returned sub-binary size,  
> default is ~8K

It would be nice to have a chunk terminator such as newline rather than 
an explicit size or would this be implemented using a gen_stream behaviour?

>              {chunks_per_proc, integer()} num of internal chunks,  
> default is 1
>              {circular, false | true} whether stream repeats, default  
> is false
> 	    {num_processes, integer()} num_processes used, default 1
> 

It is not clear to me what this means. Is this the number of processes 
which will communicate with the server or the number of processes that 
the server will spawn?

>          next_chunk(Server::pid()) -> binary() | end_of_stream


>          pct_complete(Server::pid()) -> integer() | atom()
>          stream_size(Server::pid()) -> integer() | atom()
can these return any atom or only specific ones e.g. 'infinite' or 'error'

>          stream_pos(Server::pid()) -> integer()
>          stop(Server::pid()) -> ok
> 
>      Usage:
>          Client starts the gen_stream by providing at least a stream
>          option.  The stream option indicates whether the source of the
>          stream is a file, a binary or a function.  When using a
>          socket, port or other source, the client needs to implement
>          the behavior to feed the buffers on demand.
> 
> Motivation
> 
>      There are many ways to get binary data into an erlang node,
>      however, historically it has been recommended that the data be
>      converted to a list and processed.  There are many situations
>      where leaving the binary data in its original form is preferable
>      for space or conversion efficiency reasons (e.g., when merely
>      filtering data in a relaying router process or when performing
>      statistics on raw stream data).  Providing a gen_server idiom
>      makes the default approach to processing a binary stream an
>      abstraction that is closer to an application developer's view of
>      the problem solution.
> 
>      The recent Wide Finder project [1] challenged the erlang  
> community by
>      highlighting the slowness of standard I/O functions, forcing
>      developers to use raw binary handling.  This approach seems to be a
>      common need in web service applications, yet it is quite easy to  
> do in
>      a very inefficient manner.  Providing a reference implementation  
> that
>      exposes a simpler behaviour interface would increase the class of
>      problems that erlang can solve in the hands of beginning to
>      intermediate developers.  It would also push implementers in the
>      direction of an OTP compliant application without sacrificing
>      efficiency.
> 
>      In addition, there has been a call on the email list for a
>      string_stream implementation so that a buffer of data (e.g., an  
> SMTP
>      message, HTTP request, HTML page, multi-record socket protocol  
> packet,
>      raw text database, comma-delimited file, etc.) could be treated  
> as a
>      stream of binary elements rather than a single block of data.
> 
>      Finally, testing systems often need a generative source of data  
> that
>      can be replayed or repeated in a precise manner to trigger a  
> fault or
>      test a patch to same.  The circular binary stream allows infinite
>      streams of generative data, and the behaviour stream allows a
>      functionally generated stream of data to be emitted.
> 
> 
> Rationale
> 
>      There are a few common idioms that are used when efficiently
>      handling a binary data source:
> 
>          1) "Chunking" the data to smaller sub-binaries
>          2) Buffering the chunks for efficient I/O
>          3) Few of the standard idioms are OTP-compliant
> 
>      A gen_server implementation seemed the most straight-forward
>      method for making an OTP-compliant method for chunking a serial
>      stream.  A behaviour was created so that streams could be computed
>      and generated rather than requiring a pre-constructed file or
>      binary as a source.
> 
> 
> Reference Implementation
> 
>      A working version is available at the DuoMark Website [2].
> 
> 
> References
> 
>      [1] Tim Bray's weblog
>          http://www.tbray.org/ongoing/
> 
>      [2] http://www.duomark.com/erlang/proposals/gen_stream.html
> 
> Copyright
> 
>      This document is released to the public domain.
> 
> 
> 
> Local Variables:
> mode: indented-text
> indent-tabs-mode: nil
> sentence-end-double-space: t
> fill-column: 70
> coding: utf-8
> End:
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list