[erlang-questions] eep: New gen_stream module
Jay Nelson
jay@REDACTED
Mon Dec 10 07:20:00 CET 2007
EEP: XXX
Title: gen_stream behaviour
Version: $Revision: 14 $
Last-Modified: $Date: 2007-12-10 07:17:01 +0200 (Mon, 10 Dec 2007) $
Author: Jay Nelson <jay at duomark.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 09-Dec-2007
Erlang-Version: R12B-2
Post-History: 09-Dec-2007
Abstract
An optimized behaviour module is needed to simplify the handling of
large streams of (typically binary) data for application
developers.
Specification
Module name:
gen_stream
Implementation:
A gen_server which delivers "chunks" of the stream in an
efficient
manner so that line-oriented processing or the handling of
streams
much bigger than memory (possibly even infinite) may be
absorbed by
an application.
Behaviour callbacks:
start, start_link as in gen_server
init(Args, Options) -> Same as gen_server plus list of Options:
{stream, {file, path_to_file()} |
{binary, binary()} |
{behaviour, atom(), ExtraArgs}}
{chunk_size, integer()} returned sub-binary size,
default is ~8K
{chunks_per_proc, integer()} num of internal chunks,
default is 1
{circular, false | true} whether stream repeats, default
is false
{num_processes, integer()} num_processes used, default 1
next_chunk(Server::pid()) -> binary() | end_of_stream
pct_complete(Server::pid()) -> integer() | atom()
stream_size(Server::pid()) -> integer() | atom()
stream_pos(Server::pid()) -> integer()
stop(Server::pid()) -> ok
Usage:
Client starts the gen_stream by providing at least a stream
option. The stream option indicates whether the source of the
stream is a file, a binary or a function. When using a
socket, port or other source, the client needs to implement
the behavior to feed the buffers on demand.
Motivation
There are many ways to get binary data into an erlang node,
however, historically it has been recommended that the data be
converted to a list and processed. There are many situations
where leaving the binary data in its original form is preferable
for space or conversion efficiency reasons (e.g., when merely
filtering data in a relaying router process or when performing
statistics on raw stream data). Providing a gen_server idiom
makes the default approach to processing a binary stream an
abstraction that is closer to an application developer's view of
the problem solution.
The recent Wide Finder project [1] challenged the erlang
community by
highlighting the slowness of standard I/O functions, forcing
developers to use raw binary handling. This approach seems to be a
common need in web service applications, yet it is quite easy to
do in
a very inefficient manner. Providing a reference implementation
that
exposes a simpler behaviour interface would increase the class of
problems that erlang can solve in the hands of beginning to
intermediate developers. It would also push implementers in the
direction of an OTP compliant application without sacrificing
efficiency.
In addition, there has been a call on the email list for a
string_stream implementation so that a buffer of data (e.g., an
SMTP
message, HTTP request, HTML page, multi-record socket protocol
packet,
raw text database, comma-delimited file, etc.) could be treated
as a
stream of binary elements rather than a single block of data.
Finally, testing systems often need a generative source of data
that
can be replayed or repeated in a precise manner to trigger a
fault or
test a patch to same. The circular binary stream allows infinite
streams of generative data, and the behaviour stream allows a
functionally generated stream of data to be emitted.
Rationale
There are a few common idioms that are used when efficiently
handling a binary data source:
1) "Chunking" the data to smaller sub-binaries
2) Buffering the chunks for efficient I/O
3) Few of the standard idioms are OTP-compliant
A gen_server implementation seemed the most straight-forward
method for making an OTP-compliant method for chunking a serial
stream. A behaviour was created so that streams could be computed
and generated rather than requiring a pre-constructed file or
binary as a source.
Reference Implementation
A working version is available at the DuoMark Website [2].
References
[1] Tim Bray's weblog
http://www.tbray.org/ongoing/
[2] http://www.duomark.com/erlang/proposals/gen_stream.html
Copyright
This document is released to the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
More information about the erlang-questions
mailing list