[erlang-questions] Trying to understand the performance impact of binary:split/3

Wed May 20 12:56:02 CEST 2015

binary:split is not fast and unfortunately many people do not realize that.
If you want speed, here is an implementation that is made for speed:
https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2359
https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2373

Sergej

On Wed, May 20, 2015 at 12:35 PM, José Valim <
jose.valim@REDACTED> wrote:

> Hello folks,
>
> At the beginning of the month, someone wrote a blog post comparing data
> processing between different platforms and languages, one of them being
> Erlang VM/Elixir:
>
> http://blog.dimroc.com/2015/05/07/etl-language-showdown-pt2/
>
> After running the experiments, I thought we could do much better. To my
> surprise, our biggest performance hit was when calling binary:split/3. I
> have rewritten the code to use only Erlang function calls (to make it
> clearer for this discussion):
>
>
> https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex
>
> The performance in both Erlang and Elixir variants are the same (rewritten
> in Erlang is also the same result). This line is the bottleneck:
>
>
> https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex#L11
>
> In fact, if we move the regular expression check to before the
> binary:split/3 call, we get the same performance as Go in my machine.
> Meaning that binary:split/3 is making the code at least twice slower.
>
> The binary:split/3 implementation is broken in two pieces: first we find
> all matches via binary:matches/3 and then we traverse the matches
> converting them to binaries with binary:part/3. The binary:part/3 call is
> the slow piece here.
>
> *My question is:* is this expected? Why binary:split/3 (and
> binary:part/3) is affecting performance so drastically? How can I
> investigate/understand this further?
>
> ## Other bottlenecks
>
> The other two immediate bottlenecks are the use of regular expressions and
> the use of file:read_line/3 instead of loading the whole file into memory.
> Those were given as hard requirements by the author. None the less, someone
> wrote an Erlang implementation that removes those bottlenecks too (along
> binary:split/3) and the performance is outstanding:
>
> https://github.com/dimroc/etl-language-comparison/pull/10/files
>
> I have since then rewritten the Elixir one and got a similar result.
> However I am still puzzled because using binary:split/3 would have been my
> first try (instead of relying on match+part) as it leads to cleaner code
> (imo).
>
> Thanks.
>
> *José Valim*
> www.plataformatec.com.br
> Skype: jv.ptec
> Founder and Lead Developer
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150520/542979fa/attachment.htm>