[erlang-questions] matching binary against a file
Jeff Rogers
dvrsn@REDACTED
Sat Jun 23 00:04:33 CEST 2007
Thomas Lindgren wrote:
> As far as I know, you're stuck with either reading the
> whole thing into memory, or reading blocks of data and
> writing a more complex conversion function that
> handles partial UTF characters.
Not the best possible answer, gut I can live with it. Also, just
talking it over gave me some better ideas about how to proceed.
BTW: no one pointed it out, but while my original code will correctly
read the requested number of utf-8 characters, it will corrupt anything
above U+007F. The correct reading patterns would be
<< 0:1, C:7, Rest/binary >> -> C
<< 2#110:3, C1:5, 2#10:2, C2:6, Rest/binary >> -> << 0:5, C1:5, C2:6 >>
and so on for the longer sequences.
> However, note that building ever-larger binaries in a
> loop may be very expensive. An operation like
> <<S/binary,C>> may mean copying S and C into a new
> binary in each iteration (cost = size(S)+size(C)),
> which turns the algorithm quadratic or worse as the
> accumulated S grows with each iteration. A better
> approach is to accumulate the data in a list and
> convert the final result into a binary once.
The efficiency guide specifically mentions this when appending to lists
but doesn't specifically mention binaries. But it makes sense that it
would apply there too.
> read_utf8(N, S, Data) when N > 0 ->
> NxtN = N-1,
> case Data of
> << 2#1110:4, C:20, Rest/binary >> ->
> read_utf8(NxtN, [C|S], Rest);
> << 2#110:3, C:13, Rest/binary >> ->
> read_utf8(NxtN, [C|S], Rest);
> << 0:1, C:7, Rest/binary >> ->
> read_utf8(NxtN, [C|S], Rest)
> end;
> read_utf8(0,S,Data) -> {lists:reverse(S),Data}.
This does raise a style issue I've wondered about: is it better to use
guards or match patterns? Where you have
read_utf8(N, S, Data) when N > 0 -> .... ;
read_utf8(0, S, Data) -> ... .
I would be inclined to use
read_utf8(0, S, Data) -> ... ;
read_utf8(N, S, Data) -> ... .
without using guards. Is one generally preferred, or it is just a style
choice? I assume that the compiler optimizes then to pretty much the
same thing.
Thanks
-J
More information about the erlang-questions
mailing list