[erlang-questions] matching binary against a file

Sat Jun 23 00:04:33 CEST 2007

Thomas Lindgren wrote:

> As far as I know, you're stuck with either reading the
> whole thing into memory, or reading blocks of data and
> writing a more complex conversion function that
> handles partial UTF characters.

Not the best possible answer, gut I can live with it.  Also, just 
talking it over gave me some better ideas about how to proceed.

BTW: no one pointed it out, but while my original code will correctly 
read the requested number of utf-8 characters, it will corrupt anything 
above U+007F.  The correct reading patterns would be

<< 0:1, C:7, Rest/binary >> ->  C
<< 2#110:3, C1:5, 2#10:2, C2:6, Rest/binary >> -> << 0:5, C1:5, C2:6 >>
and so on for the longer sequences.

> However, note that building ever-larger binaries in a
> loop may be very expensive. An operation like
> <<S/binary,C>> may mean copying S and C into a new
> binary in each iteration (cost = size(S)+size(C)),
> which turns the algorithm quadratic or worse as the
> accumulated S grows with each iteration. A better
> approach is to accumulate the data in a list and
> convert the final result into a binary once.

The efficiency guide specifically mentions this when appending to lists 
but doesn't specifically mention binaries.  But it makes sense that it 
would apply there too.

> read_utf8(N, S, Data) when N > 0 ->
>     NxtN = N-1,
>     case Data of
> 	<< 2#1110:4, C:20, Rest/binary >> ->
> 	    read_utf8(NxtN, [C|S], Rest);
> 	<< 2#110:3, C:13, Rest/binary >> ->
> 	    read_utf8(NxtN, [C|S], Rest);
> 	<< 0:1, C:7, Rest/binary >> ->
> 	    read_utf8(NxtN, [C|S], Rest)
>     end;
> read_utf8(0,S,Data) -> {lists:reverse(S),Data}.

This does raise a style issue I've wondered about: is it better to use 
guards or match patterns?  Where you have

read_utf8(N, S, Data) when N > 0 -> .... ;
read_utf8(0, S, Data) -> ... .

I would be inclined to use

read_utf8(0, S, Data) -> ... ;
read_utf8(N, S, Data) -> ... .

without using guards.  Is one generally preferred, or it is just a style 
choice?  I assume that the compiler optimizes then to pretty much the 
same thing.

Thanks
-J