<div dir="ltr">Thank you Sergej.<div><br></div><div>I have created a branch that uses the split version you mentioned and it is 4x times slower than using binary:split/3. Here is the commit that added the new implementation:</div><div><br></div><div><a href="https://github.com/josevalim/etl-language-comparison/commit/e6cf0a35700cef751b1052083ccec5a3c0394648">https://github.com/josevalim/etl-language-comparison/commit/e6cf0a35700cef751b1052083ccec5a3c0394648</a><br></div><div><br></div><div>Thoughts?</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div><br></div><div><br></div><div><span style="font-size:13px"><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"><b>José Valim</b></span></div><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"><div><span style="font-family:verdana,sans-serif;font-size:x-small"><a href="http://www.plataformatec.com.br/" style="color:rgb(42,93,176)" target="_blank">www.plataformatec.com.br</a></span></div><div><span style="font-family:verdana,sans-serif;font-size:x-small">Skype: jv.ptec</span></div><div><span style="font-family:verdana,sans-serif;font-size:x-small">Founder and Lead Developer</span></div></span></div></span></div></div></div>
<br><div class="gmail_quote">On Wed, May 20, 2015 at 12:56 PM, Sergej Jurečko <span dir="ltr"><<a href="mailto:sergej.jurecko@gmail.com" target="_blank">sergej.jurecko@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>binary:split is not fast and unfortunately many people do not realize that. <br>If you want speed, here is an implementation that is made for speed:<br><a href="https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2359" target="_blank">https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2359</a><br><a href="https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2373" target="_blank">https://github.com/biokoda/bkdcore/blob/master/src/butil.erl#L2373</a><br><br></div>Sergej<br></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Wed, May 20, 2015 at 12:35 PM, José Valim <span dir="ltr"><<a href="mailto:jose.valim@plataformatec.com.br" target="_blank">jose.valim@plataformatec.com.br</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div dir="ltr">Hello folks,<div><br></div><div>At the beginning of the month, someone wrote a blog post comparing data processing between different platforms and languages, one of them being Erlang VM/Elixir:</div><div><br></div><div><a href="http://blog.dimroc.com/2015/05/07/etl-language-showdown-pt2/" target="_blank">http://blog.dimroc.com/2015/05/07/etl-language-showdown-pt2/</a></div><div><br></div><div>After running the experiments, I thought we could do much better. To my surprise, our biggest performance hit was when calling binary:split/3. I have rewritten the code to use only Erlang function calls (to make it clearer for this discussion):</div><div><br></div><div><a href="https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex" target="_blank">https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex</a><br></div><div><br></div><div>The performance in both Erlang and Elixir variants are the same (rewritten in Erlang is also the same result). This line is the bottleneck:</div><div><br></div><div><a href="https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex#L11" target="_blank">https://github.com/josevalim/etl-language-comparison/blob/jv-erl/elixir/lib/map_actor.ex#L11</a><br></div><div><br></div><div>In fact, if we move the regular expression check to before the binary:split/3 call, we get the same performance as Go in my machine. Meaning that binary:split/3 is making the code at least twice slower.</div><div><br></div><div>The binary:split/3 implementation is broken in two pieces: first we find all matches via binary:matches/3 and then we traverse the matches converting them to binaries with binary:part/3. The binary:part/3 call is the slow piece here.</div><div><br></div><div><b>My question is:</b> is this expected? Why binary:split/3 (and binary:part/3) is affecting performance so drastically? How can I investigate/understand this further?</div><div><br></div><div>## Other bottlenecks</div><div><br></div><div>The other two immediate bottlenecks are the use of regular expressions and the use of file:read_line/3 instead of loading the whole file into memory. Those were given as hard requirements by the author. None the less, someone wrote an Erlang implementation that removes those bottlenecks too (along binary:split/3) and the performance is outstanding:</div><div><br></div><div><a href="https://github.com/dimroc/etl-language-comparison/pull/10/files" target="_blank">https://github.com/dimroc/etl-language-comparison/pull/10/files</a><br></div><div><br></div><div>I have since then rewritten the Elixir one and got a similar result. However I am still puzzled because using binary:split/3 would have been my first try (instead of relying on match+part) as it leads to cleaner code (imo).</div><div><div><div><div><br></div><div>Thanks.</div><span><font color="#888888"><div><br></div><div><span style="font-size:13px"><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"><b>José Valim</b></span></div><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"><div><span style="font-family:verdana,sans-serif;font-size:x-small"><a href="http://www.plataformatec.com.br/" style="color:rgb(42,93,176)" target="_blank">www.plataformatec.com.br</a></span></div><div><span style="font-family:verdana,sans-serif;font-size:x-small">Skype: jv.ptec</span></div><div><span style="font-family:verdana,sans-serif;font-size:x-small">Founder and Lead Developer</span></div></span></div></span></div></font></span></div></div>
</div></div>
<br></div></div>_______________________________________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>
<br></blockquote></div><br></div>
</blockquote></div><br></div>