Streaming Input

Joe Armstrong (AL/EAB) <>
Tue Mar 1 09:29:07 CET 2005


	This is slightly more difficult (not a lot)
you're now talking about re-entrant parsers.

	The general form of these is as follows:

	Assume you have some parsing function F which parses
a binary B. Calling F(B) returns done, P, F'} where P is a parse 
and F' a new parsing function
or {more, F'} where F' is a new parsing function.

	You could call it like this:

	Socket = ...
	F = init(),
	loop(Socket, F).

where

	loop(Socket, F) ->
	   receive
		{tcp,Socket,Bin} ->
		    case F(Bin) of
			{done,Parse, F1} ->
			    ... do something with Parse ...
			    loop(Socket, F1);
			{more, F1} ->
			    loop(Socket, F1)
		    end;
		...
	
Now you have to define F

Let's give an example. Suppose the input
has "begin" ... "end" symbols or #N (one byte followed by N bytes)

init() -> fun(B) -> top(binary_to_list(B)) end.

top("begin" + T) -> collect_body(T, []);
top([$#,N|T])    -> collect_bytes(T, N, []);
top(X)           -> fun(B) -> top(X ++ binary_to_list(B)) end.

collect_body("end" ++ T, L) -> {done, reverse(L), list_to_binary(T)};
collect_body([H|T], L) -> collect_body(T, [H|L]);
collect_body([], L) -> fun(B) -> collect_body(binary_to_list(B), L) end. 

etc.

This is pretty simple code, the point to note is the last clause of every group
	
collect_body([], L) is called when we "run out of stuff to parse" - what do we want to do
then?

Answer: wait for "More" data and then call collect_body(More, L) - that's just what the
last clause says:

	collect_body([], L) -> 
		fun(B) -> collect_body(binary_to_list(B), L) end.

This kind of code is very easy if you just "follow the pattern" and don't think (TM)
BTW you have to get the code right first time - debugging this is not easy if you
make a silly mistake :-)

Cheers

/Joe



> -----Original Message-----
> From: orbitz [mailto:]
> Sent: den 28 februari 2005 22:55
> To: Joe Armstrong (AL/EAB)
> Cc: 
> Subject: Re: Streaming Input
> 
> 
> I'm not sure that'll work in my situation necessarily.  In this 
> protocol, only some objects have a size specification, and 
> others don't. 
> And the ones that don't can be variable size. It uses 
> prefix/suffix to 
> say when decoding should start and end.  Also I don't know how much I 
> need until I've identified what type it is and started extracting it, 
> and since sometimes are variable in size and the protocol 
> uses a suffix 
> to tell me when to stop decoding that type I can't figure out 
> how much I 
> need.  Perhaps my original idea of figuring out what type it is then 
> sending to a special extract function for that type is no good?  It 
> seems simpler that way since I don't need to keep track of state, but 
> more prone to issues since I need to go back to this waiting function 
> every time I run out of data but haven't finished decoding my object.
> 
> Thanks
> 
> Joe Armstrong (AL/EAB) wrote:
> 
> >   use binaries - that's what they are for
> >
> >   First write something like this:
> >
> >	extract(BinIn, Need, BinAcc) ->
> >	    Got = size(BinIn),
> >	    if
> >	       Got > Need ->
> >			{Before, After} = split_binary(BinIn, Need),
> >		      Result = concat_binary([BinAcc, Before]),
> >		      {done, Result, After};
> >		 Got == Need ->
> >		      Result = concat_binary([BinAcc,BinIn]),
> >		      {done, Result, <<>>};
> >		 Got < Need ->
> >			BinAcc1 = concat_binary([BinAcc, BinIn]),
> >			{more, Need - Got, BinAcc1}
> >	    end.
> >
> >	<aside>
> >	 Organising the code like this should make it pretty clear
> >	 what's going on ie write the "if" clearly with three branches
> >
> >		if
> >			Got > Need ->
> >				%% too much data - have to split it
> >				...
> >			Got == Need ->
> >				%% exactly right no need to split
> >				...
> >			Got < Need ->
> >				%% not enough. no need to split
> >				...
> >		end
> >	</aside>
> >
> >	here
> >
> >	in extract(BinIn, Need, BinAcc) 
> >  		More and Sofar are binaries
> >		Need is the required block length
> >
> >	if size(BinIn) > Need we split the block into two chunks
> > 	and return {done, Bin, After} Bin = is the data you need
> >	otherwise {more, Need-Got, BinAcc}
> >
> >	BinAcc is a binary accumulator containing all the data 
> received so far.
> >
> >	Then just arrange so code to call this
> >
> >	Cheers
> >
> >	/Joe
> >
> >
> >
> >
> >
> >  
> >
> >>-----Original Message-----
> >>From: 
> >>[mailto:]On Behalf Of orbitz
> >>Sent: den 27 februari 2005 07:49
> >>To: 
> >>Subject: Streaming Input
> >>
> >>
> >>I am working with a protocol where the size of the 
> following block is 
> >>told to me so I can just convert the next N bytes to, say, 
> a string.  
> >>The problem is though, I'm trying to write this so it handles 
> >>a stream 
> >>properly, so in the binary I have could be all N bytes that I 
> >>need, or 
> >>something less than N. So at first I tried:
> >>
> >>extract_string(Tail, 0, Res) ->
> >>  {ok, {string, Res}, Tail};
> >>extract_string(<<H, Tail/binary>>, Length, Res) ->
> >>  extract_string(Tail, Length - 1, lists:append(Res, [H]));
> >>extract_string(<<>>, Length, Res) ->
> >>  case dispatch_message() of
> >>    {decode, _, Data} ->
> >>      extract_string(Data, Length, Res)
> >>  end.
> >>
> >>When the binary is empty but I still need more data it waits 
> >>for more.  
> >>I don't know if this is the proper idiom (it seems gross to 
> >>me but I am 
> >>unsure of how to do it otherwise).  This is incredibly slow 
> though.  
> >>With a long string that I need to extract it takes a lot of 
> >>CPU and far 
> >>too long.  So I decided to do:
> >>
> >>extract_string(Data, Length, _) ->
> >>  <<String:Length/binary, Tail/binary>> = Data,
> >>  {ok, {string, binary_to_list(String)}, Tail}.
> >>
> >>In terms of CPU and time this is much much better, but if I 
> >>don't have 
> >>all N bytes it won't work.  Any suggestions?
> >>
> >>    
> >>
> >
> >
> >
> >  
> >
> 



More information about the erlang-questions mailing list