Parsing infinite streams style

Fri Mar 5 14:16:07 CET 2004

There are two ways of doing this

	1) pass some complex data structure in the loop
	   and update it to reflect the state of the parsing	

	   loop(State) ->
	      receive
		 {From, Data} ->
		    case parse(Data, State) of
			{done,Result,State1}
			    do something with Result
			    loop(State1);
			{notEnoughData, State1} ->
			    loop(State1)

	     end

	2) Or you can pass the parsing function in the loop

	loop(F) ->
	    receive
		{From, Data} ->
		    case F(Data) of
			{done, Result, F1} ->
				do something with result
				loop(F1);
			{notEnoughData, F1} ->
				loop(F1)
		    end;
		...
	    end.

F is a parsing fun.

I prefer 2) why - it's more general in 1) the parse routine is hard-wired
into the loop. In 2) its a variable. 

In 1) the parse function has two arguments. In 2) it always has one argument

Let me make this clearer with an example.

Suppose we have a infinite stream

like this

begin <Int> end
begin <Int> end

but this can be broken into arbitrary chucks - we can write a reentrant parser
like this:

get_begin("begin " ++ T) -> get_int(T, 0);
get_begin(S)             -> {more, fun(Data) -> get_begin(S ++ D) end}.

get_int(" " ++ T, N) -> get_end(T, N);
get_int([H|T], N)    -> get_int(T, N*10-H-$0);
get_int([], N)       -> {more, fun(S) -> get_int(S, N)}.

get_end("end" ++ T, N) -> {done, N, fun(I) -> get_begin(I)};
get_end(T, N)          -> {more, fun(S) -> get_end(T++S, N) end}.

This way avoids having to restart the parsing at the beginning 
every time.

  Having said  all of this I'd  be rather careful  - the "continuation
passing with funs" method is elegant BUT a bugger to debug if you get
it wrong - continuation passing with explicit data structures is also a
bit of a pain.

  Often "append the new data to  the old and restart at the beginning"
is the  fastest method and  the easiest to  program. The best  case is
when you  design the  protocol yourself -  if you design  the protocol
with  an easy  to  recognize "stop  symbol"  (or use  a prefix  length
counter) then it's probably much quicker to pre-process the data until
you get all you need and then do the parsing.

  Building a  closure to restart from  might cost more  than doing the
data append.

  What I'd  do is write the  simplest "append and  restart method" and
only rewrite it if I needed to.

/Joe

On Thu, 4 Mar 2004, Edmund Dengler wrote:

> Hi all!
> 
> A bit of continuation on my last question concerning style. I have an
> external executable that will be feeding binary data to my Erlang
> processes via a port. While I am parsing the data, I may need more bytes
> to continue (as the source is an "infinite" stream of bytes). What is the
> accepted methodology for doing this:
> 
> (1) Parse what I have, if I don't have enough, cause an error to occur,
> and return to the place I started from (or at least, return as much as I
> could parse, and the remainder). Basically, I am parsing the next "chunk"
> of data, and returning to the start point. Ie:
> 
>   loop
>     receive more bytes (& append to ones we currently have)
>     while parse next chunk is successful
>       call processing function
> 
> If I run out of bytes during parsing, I keep adding onto the list I have.
> _But_, I spend a lot of work reparsing what I have already done. To do
> this efficiently, I would need some kind of continutation mechanism to say
> "here are more bytes, continue where you left off", which I don't believe
> Erlang has.
> 
> (2) Have some kind of lazy semantics of "get more bytes, I need it now".
> 
> If I do (2), I obviously need to pass along functions to call as I match
> each grouping (basically, "process the stuff we have matched so far, for
> each chunk, call the processing function") along with some mechanism to
> fetch more bytes. It also means that at every stage of the parse, I need
> to check to see if I have enough bytes, and if I don't, call the "get more
> bytes, and try again", complicating my code (rather than the simple
> "don't have enough, fail" model of (1)).
> 
> I guess I could start to build a framework ala the FSM or ASN stuff, that
> would do this wrapping for me, though it seems it would be a lot of work
> and in the end would be the correct way (and obviously would allow me to
> specify a DSL that makes specifying the patterns better; definitely the
> approach I would take if using Scheme or Lisp).
> 
> Is there a current style/methodology I should be looking at to do the
> above? What is the "Erlang way"?
> 
> Thanks!
> Ed
>