[erlang-questions] Trying to understand the performance impact of binary:split/3

Wed May 20 12:35:18 CEST 2015

Hello folks,

At the beginning of the month, someone wrote a blog post comparing data
processing between different platforms and languages, one of them being
Erlang VM/Elixir:

http://blog.dimroc.com/2015/05/07/etl-language-showdown-pt2/

After running the experiments, I thought we could do much better. To my
surprise, our biggest performance hit was when calling binary:split/3. I
have rewritten the code to use only Erlang function calls (to make it
clearer for this discussion):

https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex

The performance in both Erlang and Elixir variants are the same (rewritten
in Erlang is also the same result). This line is the bottleneck:

https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex#L11

In fact, if we move the regular expression check to before the
binary:split/3 call, we get the same performance as Go in my machine.
Meaning that binary:split/3 is making the code at least twice slower.

The binary:split/3 implementation is broken in two pieces: first we find
all matches via binary:matches/3 and then we traverse the matches
converting them to binaries with binary:part/3. The binary:part/3 call is
the slow piece here.

*My question is:* is this expected? Why binary:split/3 (and binary:part/3)
is affecting performance so drastically? How can I investigate/understand
this further?

## Other bottlenecks

The other two immediate bottlenecks are the use of regular expressions and
the use of file:read_line/3 instead of loading the whole file into memory.
Those were given as hard requirements by the author. None the less, someone
wrote an Erlang implementation that removes those bottlenecks too (along
binary:split/3) and the performance is outstanding:

https://github.com/dimroc/etl-language-comparison/pull/10/files

I have since then rewritten the Elixir one and got a similar result.
However I am still puzzled because using binary:split/3 would have been my
first try (instead of relying on match+part) as it leads to cleaner code
(imo).

Thanks.

*José Valim*
www.plataformatec.com.br
Skype: jv.ptec
Founder and Lead Developer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150520/b4139710/attachment.htm>