[erlang-questions] Trying to understand the performance impact of binary:split/3
José Valim
jose.valim@REDACTED
Wed May 20 13:31:44 CEST 2015
Most of the binary module is implemented as BIFs with the exception of
binary:split/3 and binary:replace/4. So binary:split/3 is indeed written in
pure Erlang although the matches are found with binary:matches/3 (which is
a BIF).
*José Valim*
www.plataformatec.com.br
Skype: jv.ptec
Founder and Lead Developer
On Wed, May 20, 2015 at 1:29 PM, Sergej Jurečko <sergej.jurecko@REDACTED>
wrote:
> Well I'll be damned. I thought I read somewhere that binary module was
> implemented in erlang (not bifs) which was why it was slow. I guess I never
> checked. I take my statements back :)
>
> Sergej
>
> On Wed, May 20, 2015 at 1:14 PM, José Valim <
> jose.valim@REDACTED> wrote:
>
>> Thank you Sergej.
>>
>> I have created a branch that uses the split version you mentioned and it
>> is 4x times slower than using binary:split/3. Here is the commit that added
>> the new implementation:
>>
>>
>> https://github.com/josevalim/etl-language-comparison/commit/e6cf0a35700cef751b1052083ccec5a3c0394648
>>
>> Thoughts?
>>
>>
>>
>> *José Valim*
>> www.plataformatec.com.br
>> Skype: jv.ptec
>> Founder and Lead Developer
>>
>> On Wed, May 20, 2015 at 12:56 PM, Sergej Jurečko <
>> sergej.jurecko@REDACTED> wrote:
>>
>>> binary:split is not fast and unfortunately many people do not realize
>>> that.
>>> If you want speed, here is an implementation that is made for speed:
>>> https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2359
>>> https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2373
>>>
>>> Sergej
>>>
>>> On Wed, May 20, 2015 at 12:35 PM, José Valim <
>>> jose.valim@REDACTED> wrote:
>>>
>>>> Hello folks,
>>>>
>>>> At the beginning of the month, someone wrote a blog post comparing data
>>>> processing between different platforms and languages, one of them being
>>>> Erlang VM/Elixir:
>>>>
>>>> http://blog.dimroc.com/2015/05/07/etl-language-showdown-pt2/
>>>>
>>>> After running the experiments, I thought we could do much better. To my
>>>> surprise, our biggest performance hit was when calling binary:split/3. I
>>>> have rewritten the code to use only Erlang function calls (to make it
>>>> clearer for this discussion):
>>>>
>>>>
>>>> https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex
>>>>
>>>> The performance in both Erlang and Elixir variants are the same
>>>> (rewritten in Erlang is also the same result). This line is the bottleneck:
>>>>
>>>>
>>>> https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex#L11
>>>>
>>>> In fact, if we move the regular expression check to before the
>>>> binary:split/3 call, we get the same performance as Go in my machine.
>>>> Meaning that binary:split/3 is making the code at least twice slower.
>>>>
>>>> The binary:split/3 implementation is broken in two pieces: first we
>>>> find all matches via binary:matches/3 and then we traverse the matches
>>>> converting them to binaries with binary:part/3. The binary:part/3 call is
>>>> the slow piece here.
>>>>
>>>> *My question is:* is this expected? Why binary:split/3 (and
>>>> binary:part/3) is affecting performance so drastically? How can I
>>>> investigate/understand this further?
>>>>
>>>> ## Other bottlenecks
>>>>
>>>> The other two immediate bottlenecks are the use of regular expressions
>>>> and the use of file:read_line/3 instead of loading the whole file into
>>>> memory. Those were given as hard requirements by the author. None the less,
>>>> someone wrote an Erlang implementation that removes those bottlenecks too
>>>> (along binary:split/3) and the performance is outstanding:
>>>>
>>>> https://github.com/dimroc/etl-language-comparison/pull/10/files
>>>>
>>>> I have since then rewritten the Elixir one and got a similar result.
>>>> However I am still puzzled because using binary:split/3 would have been my
>>>> first try (instead of relying on match+part) as it leads to cleaner code
>>>> (imo).
>>>>
>>>> Thanks.
>>>>
>>>> *José Valim*
>>>> www.plataformatec.com.br
>>>> Skype: jv.ptec
>>>> Founder and Lead Developer
>>>>
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150520/e4b25ffa/attachment.htm>
More information about the erlang-questions
mailing list