[erlang-questions] matching binary against a file
Thomas Lindgren
thomasl_erlang@REDACTED
Fri Jun 22 10:19:47 CEST 2007
--- Jeff Rogers <dvrsn@REDACTED> wrote:
> I'm just learning erlang and I'm writing a module to
> read lucene
> indexes. One problem I keep running into is that
> the lucene file format
> appears to be explicitly designed to only be
> readable one byte at a
> time, which makes binary match operations painful.
> One example of this
> is reading UTF-8 encoded data (I'm not worrying
> about modified utf-8
> just yet).
>
> Here is what I'm currently using to a specified
> count of read utf-8
> encoded characters from an in-memory binary:
>
> read_utf8(C,D) -> read_utf8(C,<<>>,D).
> read_utf8(0,S,Data) -> [S,Data];
> read_utf8(Cnt,S,<< 2#1110:4, C:20, Data/binary >>)
> ->
> read_utf8(Cnt-1,<<S/binary,C>>,Data);
> read_utf8(Cnt,S,<< 2#110:3, C:13, Data/binary >>) ->
> read_utf8(Cnt-1,<<S/binary,C>>,Data);
> read_utf8(Cnt,S,<< 0:1, C:7, Data/binary >>) ->
> read_utf8(Cnt-1,<<S/binary,C>>,Data).
>
> Now this is all fine as long as I first read the
> entire string into
> memory, but because you can't know beforehand how
> many bytes will be
> read you have to overestimate and then handle
> whatever is left over.
> Reading the entire multi-megabyte file into a single
> binary object
> strikes me as inefficient (and may not even be
> possible in all cases).
>
> So, is there any way to do a binary match against
> file data directly
> (like you could do with mmap)? Or am I just taking
> the wrong approach here?
As far as I know, you're stuck with either reading the
whole thing into memory, or reading blocks of data and
writing a more complex conversion function that
handles partial UTF characters.
However, note that building ever-larger binaries in a
loop may be very expensive. An operation like
<<S/binary,C>> may mean copying S and C into a new
binary in each iteration (cost = size(S)+size(C)),
which turns the algorithm quadratic or worse as the
accumulated S grows with each iteration. A better
approach is to accumulate the data in a list and
convert the final result into a binary once.
I modified your code a bit to address these concerns,
but have, in the tradition of these things, not tested
it; I hope the principles are clear at least. Also,
production code would be a bit more careful.
The first function accumulates the output in a list,
but is otherwise intended to be identical to your
code. The second function, partial_utf8, is basically
the same as the first, but reads more data when it
runs out.
I hope Yahoo doesn't mess up the line breaks too
badly.
Best,
Thomas
-----------------------------
-module(read_utf).
-compile(export_all).
read_utf8(N,D) -> read_utf8(N, [], D).
read_utf8_bin_out(N, D) ->
{Str, Rest} = read_utf8(N, D),
{list_to_binary(Str), Rest}.
%% returns a list of N UTF characters + remaining data
%%
%% - alt: accumulate data as [S|C], then do
list_to_binary as final step
%% (which is less clear, but removes the
lists:reverse)
read_utf8(N, S, Data) when N > 0 ->
NxtN = N-1,
case Data of
<< 2#1110:4, C:20, Rest/binary >> ->
read_utf8(NxtN, [C|S], Rest);
<< 2#110:3, C:13, Rest/binary >> ->
read_utf8(NxtN, [C|S], Rest);
<< 0:1, C:7, Rest/binary >> ->
read_utf8(NxtN, [C|S], Rest)
end;
read_utf8(0,S,Data) -> {lists:reverse(S),Data}.
%% This version reads N UTF-8 chars from FD if
possible. Some copying is
%% done when Data and MoreData are concatenated.
partial_utf8(N, File) ->
{ok, FD} = file:open(File, [read]),
Res = (catch partial_utf8(N, [], <<>>, FD)),
file:close(FD),
Res.
-define(blocksize, 4096).
partial_utf8(N, S, Data, FD) when N > 0 ->
NxtN = N-1,
case Data of
<< 2#1110:4, C:20, Rest/binary >> ->
partial_utf8(NxtN, [C|S], Rest, FD);
<< 2#110:3, C:13, Rest/binary >> ->
partial_utf8(NxtN, [C|S], Rest, FD);
<< 0:1, C:7, Rest/binary >> ->
partial_utf8(NxtN, [C|S], Rest, FD);
_ ->
case file:read(FD, ?blocksize) of
{ok, MoreData} ->
%% retry call with more data (uses N, not NxtN)
partial_utf8(N, S, <<Data/binary,
MoreData/binary>>, FD);
_Err ->
exit(malformed_utf8)
end
end;
partial_utf8(0, S, Data, _FD) -> {lists:reverse(S),
Data}.
____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search
that gives answers, not web links.
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC
More information about the erlang-questions
mailing list