[erlang-questions] Trying to understand the performance impact of binary:split/3

Darach Ennis darach@REDACTED
Wed May 20 13:07:30 CEST 2015


Hi Sergej,

Have you any rough benchmark numbers? Perhaps this is worth
contributing as a replacement to binary:split?

Cheers,

Darach.

On Wed, May 20, 2015 at 11:56 AM, Sergej Jurečko <sergej.jurecko@REDACTED>
wrote:

> binary:split is not fast and unfortunately many people do not realize
> that.
> If you want speed, here is an implementation that is made for speed:
> https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2359
> https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2373
>
> Sergej
>
> On Wed, May 20, 2015 at 12:35 PM, José Valim <
> jose.valim@REDACTED> wrote:
>
>> Hello folks,
>>
>> At the beginning of the month, someone wrote a blog post comparing data
>> processing between different platforms and languages, one of them being
>> Erlang VM/Elixir:
>>
>> http://blog.dimroc.com/2015/05/07/etl-language-showdown-pt2/
>>
>> After running the experiments, I thought we could do much better. To my
>> surprise, our biggest performance hit was when calling binary:split/3. I
>> have rewritten the code to use only Erlang function calls (to make it
>> clearer for this discussion):
>>
>>
>> https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex
>>
>> The performance in both Erlang and Elixir variants are the same
>> (rewritten in Erlang is also the same result). This line is the bottleneck:
>>
>>
>> https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex#L11
>>
>> In fact, if we move the regular expression check to before the
>> binary:split/3 call, we get the same performance as Go in my machine.
>> Meaning that binary:split/3 is making the code at least twice slower.
>>
>> The binary:split/3 implementation is broken in two pieces: first we find
>> all matches via binary:matches/3 and then we traverse the matches
>> converting them to binaries with binary:part/3. The binary:part/3 call is
>> the slow piece here.
>>
>> *My question is:* is this expected? Why binary:split/3 (and
>> binary:part/3) is affecting performance so drastically? How can I
>> investigate/understand this further?
>>
>> ## Other bottlenecks
>>
>> The other two immediate bottlenecks are the use of regular expressions
>> and the use of file:read_line/3 instead of loading the whole file into
>> memory. Those were given as hard requirements by the author. None the less,
>> someone wrote an Erlang implementation that removes those bottlenecks too
>> (along binary:split/3) and the performance is outstanding:
>>
>> https://github.com/dimroc/etl-language-comparison/pull/10/files
>>
>> I have since then rewritten the Elixir one and got a similar result.
>> However I am still puzzled because using binary:split/3 would have been my
>> first try (instead of relying on match+part) as it leads to cleaner code
>> (imo).
>>
>> Thanks.
>>
>> *José Valim*
>> www.plataformatec.com.br
>> Skype: jv.ptec
>> Founder and Lead Developer
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150520/c4c508b1/attachment.htm>


More information about the erlang-questions mailing list