[erlang-questions] Trying to understand the performance impact of binary:split/3

Sergej Jurečko <>
Wed May 20 13:29:07 CEST 2015


Well I'll be damned. I thought I read somewhere that binary module was
implemented in erlang (not bifs) which was why it was slow. I guess I never
checked. I take my statements back :)

Sergej

On Wed, May 20, 2015 at 1:14 PM, José Valim <
> wrote:

> Thank you Sergej.
>
> I have created a branch that uses the split version you mentioned and it
> is 4x times slower than using binary:split/3. Here is the commit that added
> the new implementation:
>
>
> https://github.com/josevalim/etl-language-comparison/commit/e6cf0a35700cef751b1052083ccec5a3c0394648
>
> Thoughts?
>
>
>
> *José Valim*
> www.plataformatec.com.br
> Skype: jv.ptec
> Founder and Lead Developer
>
> On Wed, May 20, 2015 at 12:56 PM, Sergej Jurečko <
> > wrote:
>
>> binary:split is not fast and unfortunately many people do not realize
>> that.
>> If you want speed, here is an implementation that is made for speed:
>> https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2359
>> https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2373
>>
>> Sergej
>>
>> On Wed, May 20, 2015 at 12:35 PM, José Valim <
>> > wrote:
>>
>>> Hello folks,
>>>
>>> At the beginning of the month, someone wrote a blog post comparing data
>>> processing between different platforms and languages, one of them being
>>> Erlang VM/Elixir:
>>>
>>> http://blog.dimroc.com/2015/05/07/etl-language-showdown-pt2/
>>>
>>> After running the experiments, I thought we could do much better. To my
>>> surprise, our biggest performance hit was when calling binary:split/3. I
>>> have rewritten the code to use only Erlang function calls (to make it
>>> clearer for this discussion):
>>>
>>>
>>> https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex
>>>
>>> The performance in both Erlang and Elixir variants are the same
>>> (rewritten in Erlang is also the same result). This line is the bottleneck:
>>>
>>>
>>> https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex#L11
>>>
>>> In fact, if we move the regular expression check to before the
>>> binary:split/3 call, we get the same performance as Go in my machine.
>>> Meaning that binary:split/3 is making the code at least twice slower.
>>>
>>> The binary:split/3 implementation is broken in two pieces: first we find
>>> all matches via binary:matches/3 and then we traverse the matches
>>> converting them to binaries with binary:part/3. The binary:part/3 call is
>>> the slow piece here.
>>>
>>> *My question is:* is this expected? Why binary:split/3 (and
>>> binary:part/3) is affecting performance so drastically? How can I
>>> investigate/understand this further?
>>>
>>> ## Other bottlenecks
>>>
>>> The other two immediate bottlenecks are the use of regular expressions
>>> and the use of file:read_line/3 instead of loading the whole file into
>>> memory. Those were given as hard requirements by the author. None the less,
>>> someone wrote an Erlang implementation that removes those bottlenecks too
>>> (along binary:split/3) and the performance is outstanding:
>>>
>>> https://github.com/dimroc/etl-language-comparison/pull/10/files
>>>
>>> I have since then rewritten the Elixir one and got a similar result.
>>> However I am still puzzled because using binary:split/3 would have been my
>>> first try (instead of relying on match+part) as it leads to cleaner code
>>> (imo).
>>>
>>> Thanks.
>>>
>>> *José Valim*
>>> www.plataformatec.com.br
>>> Skype: jv.ptec
>>> Founder and Lead Developer
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> 
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150520/6fb8d1c3/attachment.html>


More information about the erlang-questions mailing list