[erlang-questions] Trying to understand the performance impact of binary:split/3

Wed May 20 14:28:52 CEST 2015

On Wed, May 20, 2015 at 1:14 PM, José Valim <jose.valim@REDACTED
> wrote:

> Thoughts?

Matching:

The structure of the Needle and the Haystack matter:

1) In a search for Supercalifragilisticexpialidocious, search algorithms
such as Boyer-Moore wins. GNU Grep even unrolls the inner BM loop to make
it even faster.
2) If the needle is a $, separator in a CSV file, then naive string seach
can do surprisingly well.
3) If you are searching for DNA sequence matches, the low alphabet of GACT
means you have lots of partial matches. In that case the "standard"
algorithm is Knuth-Morris-Pratt.

Part'ing:

running part is O(1), so constant factors will play an ever increasing
role. Importantly in a call binary:part(Bin, Pos), the part/2 BIF can't
know a priori that Pos is valid for the binary. So it has to run a bounds
check, even though binary:matches/3 will always provide valid positions.
Chances are that a low-level naive approach written in a statically typed
language is able to eliminate those bounds checks because it knows stuff.

Fusion:

Erlang can't fuse the construction of the MatchList with the reduction on
the part call. So you build up a list of matches and then you consume that
list right after. Compilers with a fuse-pass can eliminate these
intermediate lists.

-- 
J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150520/56a03c99/attachment.htm>