[erlang-questions] list vs binary performarnce, destructuring and consing

Tue Oct 23 19:14:44 CEST 2012

Okay, I think I see what you are getting at. Less copying, more slicing.
I've taken this approach, and designed another mini-test, this time
emulating the parsing task more closely. The first two are the equivalent
of the previous binary and list string copying functions. The third takes
the approach of counting the binary length before finally returning a slice
of it. My approach is a bit simpler than yours Dmitry. I realize that yours
would keep on counting from the beginning of the binary and not return a
Rest of the binary, but since I tested with a simple string "field:value" I
don't think there would be a difference.

Anyway, for this test the counting method was actually slower. Maybe a real
world test where the parser is dealing with longer strings and multiple
tokens

test_bin_parse(Bin, StopperChar) ->
    test_bin_parse(Bin, StopperChar, <<>>).
test_bin_parse(<<Char, Rest/binary>>, Char, Accum) ->
    {Accum, Rest};
test_bin_parse(<<Char, Rest/binary>>, Stopper, Accum) ->
    test_bin_parse(Rest, Stopper, <<Accum/binary, Char>>).

test_string_parse(String, StopperChar) ->
    test_string_parse(String, StopperChar, []).
test_string_parse([Char|_Rest], Char, Accum) ->
    {lists:reverse(Accum), Rest};
test_string_parse([Char|Rest], Stopper, Accum) ->
    test_string_parse(Rest, Stopper, [Char|Accum]).

test_bin_parse2(Bin, StopperChar) ->
    test_bin_parse2(Bin, StopperChar, 0).
test_bin_parse2(Bin, Stopper, Len) ->
    case Bin of
<<_Field:Len/binary>> ->
    undefined;
<<Field:Len/binary, Stopper, Rest/binary>> ->
    {Field, Rest};
_Else ->
    test_bin_parse2(Bin, Stopper, Len+1)
    end.

BTW I couldn't get the binary length to pattern match in a function head,
so had to do it in a case instead.

Erik.

On Tue, Oct 23, 2012 at 8:34 AM, Dmitry Kolesnikov
<dmkolesnikov@REDACTED>wrote:

> Hello,
>
> I believe you should trade-off between memory utilization and performance.
> Yes, lists are faster then binary but requires more memory.
>
> I have figure out that binary parsing might be efficient if you scan it by
> changing index and capture tokens instead of single byte.
> , e.g.:
>
> parse(In, Pos, Len, Line, Fun, Acc0) ->
>
> case In of
>    % end of field
>
>    <<_:Pos/binary, Tkn:Len/binary, ?FIELD_BY,  _/binary>> ->
>          parse(In, Pos + Len + 1, 0, [Tkn | Line], Fun, Acc0);
>    ...
>
>    _ ->
>       % no match increase token
>       parse(In, Pos, Len + 1, Line, Fun, Acc0)
> end;
>
> - Dmitry
>
> On Oct 23, 2012, at 6:11 PM, Erik Pearson wrote:
>
> I used a simple thing like:
>
> test_iter(Mod, Fun, Args, Iters) ->
>     test_iter(Mod, Fun, Args, now(), Iters, Iters).
>
> test_iter(_Mod, _Fun, _Args, Start, Iters, 0) ->
>     Iters/(timer:now_diff(now(), Start)/1000000);
>
> test_iter(Mod, Fun, Args, Start, Iters, CountDown) ->
>     erlang:apply(Mod, Fun, Args),
>     test_iter(Mod, Fun, Args, Start, Iters, CountDown-1).
>
> And was just looking at total iterations per sec. I would just repeat this
> several times until I found a relatively stable reading. Sure, there was
> variation, but I'm looking to simulate the pattern I use in this library,
> which is iterating through and copying many small bits of text (json keys
> and values.)
>
> Since there was such a large difference in overall performance (string
> being more than twice as fast), I didn't feel the need to be more precise
> before posing the question.
>
> E.g.
>
> 20> json:test_iter(json, test_bin_copy, [<<"hi, my name is erik">>], 1000).
> 232018.5614849188
> 21> json:test_iter(json, test_bin_copy, [<<"hi, my name is erik">>], 1000).
> 224870.69934787496
> 22> json:test_iter(json, test_bin_copy, [<<"hi, my name is erik">>], 1000).
> 226193.16896629723
> 23>
> 23> json:test_iter(json, test_string_copy, ["hi, my name is erik"],
> 100000).
> 650732.3993154295
> 24> json:test_iter(json, test_string_copy, ["hi, my name is erik"],
> 100000).
> 608076.4716970806
> 25> json:test_iter(json, test_string_copy, ["hi, my name is erik"],
> 100000).
> 567359.7912115968
>
> Many of the follow up observations and questions have been stimulating, so
> I'm now interested as well in a more detailed analysis.
>
> However, in the end what I'm looking at is the differences in performance
> between list and binary string processing under what I believe is idiomatic
> Erlang for such problems.
>
> Thanks,
>
> Erik.
>
> On Tue, Oct 23, 2012 at 1:07 AM, Martynas Pumputis <martynasp@REDACTED>wrote:
>
>> Hi,
>>
>> Could you show the exact steps of your simulation? Binary version should
>> be faster, because some extra memory allocation is avoided per each
>> iteration and large binaries aren't being copied.
>>
>> Take a look at: http://www.erlang.org/doc/**efficiency_guide/**
>> binaryhandling.html<http://www.erlang.org/doc/efficiency_guide/binaryhandling.html>
>>
>> Martynas
>>
>>
>> On 10/23/2012 12:18 AM, Erik Pearson wrote:
>>
>>> Hi,
>>> I've read from advice given many years ago that processing binaries byte
>>> by byte (e.g. a recursive parser), performance is better using a list to
>>> accumulate the bytes, rather than using binary concatenation. So
>>> [B|Accum] rather than <<Accum/binary, B>>. There seems to be
>>> a consensus  however, on the efficiency of Binaries compared to List
>>> strings.
>>>
>>> My own quick test, which was just to copy a list or binary element by
>>> element, showed much better performance for the list version. The test
>>> was basically to pass an arbitrary string or binary, and copy it some
>>> number of thousands of times, and output the complete copies per second.
>>>
>>> I tried list based accumulation for a binary, using binary destructuring
>>> in the function head, and that sped things up, but it was still slower
>>> than the equivalent list string copy.
>>>
>>> Are there any tips for binaries? Of is this not a good use case for
>>> binaries.
>>>
>>> test_bin_copy(Bin) ->
>>>      test_bin_copy(Bin, <<>>).
>>> test_bin_copy(<<>>, Accum) ->
>>>      Accum;
>>> test_bin_copy(<<Char, Rest/binary>>, Accum) ->
>>>      test_bin_copy(Rest, <<Accum/binary, Char>>).
>>>
>>> test_string_copy(Bin) ->
>>>      test_string_copy(Bin, []).
>>> test_string_copy([], Accum) ->
>>>      lists:reverse(Accum);
>>> test_string_copy([Char|Rest], Accum) ->
>>>      test_string_copy(Rest, [Char|Accum]).
>>>
>>> For what its worth this is part of a json module. The current practice
>>> in json libraries seems to  favor binaries, so I assumed there were
>>> inherent performance advantages. I can imagine, e.g., that an empty
>>> binary would be stored as a modest sized buffer that would be appended
>>> in place until there was a need to expand it or copy (e.g. if an older
>>> version of it was being appended), and that operations on it would be
>>> fast compared to arbitrary consing (which is however highly optimized.)
>>>
>>> I think some of the favoritism for binaries in json libs is because it
>>> makes it easy to differentiate json strings (as erlang binaries) from
>>> json arrays (as erlang lists), but my implementation is using tagged
>>> tuples to contain each json value, so this is not a concern. Of course
>>> there are the memory concerns, but in my tests any memory concerns with
>>> list char size vs binary bytes is erased by the performance gains.
>>>
>>> I'm sure I've put my foot in my mouth at least once, but, anyway, advice
>>> appreciated.
>>>
>>> Thanks,
>>> Erik.
>>>
>>>
>>> ______________________________**_________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/**listinfo/erlang-questions<http://erlang.org/mailman/listinfo/erlang-questions>
>>>
>>>
>>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121023/1d9fd65f/attachment.htm>