[erlang-questions] matching binary against a file

Fri Jun 22 10:19:47 CEST 2007

--- Jeff Rogers <dvrsn@REDACTED> wrote:

> I'm just learning erlang and I'm writing a module to
> read lucene 
> indexes.  One problem I keep running into is that
> the lucene file format 
> appears to be explicitly designed to only be
> readable one byte at a 
> time, which makes binary match operations painful. 
> One example of this 
> is reading UTF-8 encoded data (I'm not worrying
> about modified utf-8 
> just yet).
> 
> Here is what I'm currently using to a specified
> count of read utf-8 
> encoded characters from an in-memory binary:
> 
> read_utf8(C,D) -> read_utf8(C,<<>>,D).
> read_utf8(0,S,Data) -> [S,Data];
> read_utf8(Cnt,S,<< 2#1110:4, C:20, Data/binary >>)
> ->
>      read_utf8(Cnt-1,<<S/binary,C>>,Data);
> read_utf8(Cnt,S,<< 2#110:3, C:13, Data/binary >>) ->
>      read_utf8(Cnt-1,<<S/binary,C>>,Data);
> read_utf8(Cnt,S,<< 0:1, C:7, Data/binary >>) ->
>      read_utf8(Cnt-1,<<S/binary,C>>,Data).
> 
> Now this is all fine as long as I first read the
> entire string into 
> memory, but because you can't know beforehand how
> many bytes will be 
> read you have to overestimate and then handle
> whatever is left over.
> Reading the entire multi-megabyte file into a single
> binary object 
> strikes me as inefficient (and may not even be
> possible in all cases).
> 
> So, is there any way to do a binary match against
> file data directly 
> (like you could do with mmap)?  Or am I just taking
> the wrong approach here?

As far as I know, you're stuck with either reading the
whole thing into memory, or reading blocks of data and
writing a more complex conversion function that
handles partial UTF characters.

However, note that building ever-larger binaries in a
loop may be very expensive. An operation like
<<S/binary,C>> may mean copying S and C into a new
binary in each iteration (cost = size(S)+size(C)),
which turns the algorithm quadratic or worse as the
accumulated S grows with each iteration. A better
approach is to accumulate the data in a list and
convert the final result into a binary once.

I modified your code a bit to address these concerns,
but have, in the tradition of these things, not tested
it; I hope the principles are clear at least. Also,
production code would be a bit more careful.

The first function accumulates the output in a list,
but is otherwise intended to be identical to your
code. The second function, partial_utf8, is basically
the same as the first, but reads more data when it
runs out. 

I hope Yahoo doesn't mess up the line breaks too
badly.

Best,
Thomas

-----------------------------
-module(read_utf).
-compile(export_all).

read_utf8(N,D) -> read_utf8(N, [], D).

read_utf8_bin_out(N, D) ->
    {Str, Rest} = read_utf8(N, D),
    {list_to_binary(Str), Rest}.

%% returns a list of N UTF characters + remaining data
%%
%% - alt: accumulate data as [S|C], then do
list_to_binary as final step
%%   (which is less clear, but removes the
lists:reverse)

read_utf8(N, S, Data) when N > 0 ->
    NxtN = N-1,
    case Data of
	<< 2#1110:4, C:20, Rest/binary >> ->
	    read_utf8(NxtN, [C|S], Rest);
	<< 2#110:3, C:13, Rest/binary >> ->
	    read_utf8(NxtN, [C|S], Rest);
	<< 0:1, C:7, Rest/binary >> ->
	    read_utf8(NxtN, [C|S], Rest)
    end;
read_utf8(0,S,Data) -> {lists:reverse(S),Data}.

%% This version reads N UTF-8 chars from FD if
possible. Some copying is
%% done when Data and MoreData are concatenated.

partial_utf8(N, File) ->
    {ok, FD} = file:open(File, [read]),
    Res = (catch partial_utf8(N, [], <<>>, FD)),
    file:close(FD),
    Res.

-define(blocksize, 4096).

partial_utf8(N, S, Data, FD) when N > 0 ->
    NxtN = N-1,
    case Data of
	<< 2#1110:4, C:20, Rest/binary >> ->
	    partial_utf8(NxtN, [C|S], Rest, FD);
	<< 2#110:3, C:13, Rest/binary >> ->
	    partial_utf8(NxtN, [C|S], Rest, FD);
	<< 0:1, C:7, Rest/binary >> ->
	    partial_utf8(NxtN, [C|S], Rest, FD);
	_ ->
	    case file:read(FD, ?blocksize) of
		{ok, MoreData} ->
		    %% retry call with more data (uses N, not NxtN)
		    partial_utf8(N, S, <<Data/binary,
MoreData/binary>>, FD);
		_Err ->
		    exit(malformed_utf8)
	    end

    end;
partial_utf8(0, S, Data, _FD) -> {lists:reverse(S),
Data}.

____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search 
that gives answers, not web links. 
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC