[eeps] optimized search in binaries

Sun Jan 27 20:20:07 CET 2008

Hello,
I would like to propose to create two optimized BIF for search in binaries.
These functions can be used to increase effectiveness of text parsing. These
functions can be used for search of the new line, to accellerate and to
parse XML documents. Something like these:

find_1_byte( Char, Bytes ) -> Pos | not_found
Pos = Char = integer()
Bytes = bytes()

Find 1-byte character in binary string. This function can be implemented to
access the memory with 32-bit words. I can remember I seen functions
implemented like this in Intel libraries.

find_4_bytes( FourChars, Bytes ) -> Pos | not_found
FourChars = ??? << Symbols:32 >> |  [ A, B, C, D ]
A = B = C = D = integer()

The same idea, but here you have the possibility to compare 32-bit values at
once. This function can be used to accelerate search for substring.

It would probably be preferable to implement general search for substring in
BIF. Interface when we taking the list of 4 integers looks not good.

Both functions might be generalized to work on iolist types.

I've done some tests with reading files and splitting it for new lines with
erlang - quite similar to line_server.erl module. This task is CPU bound,
and most of the time is spent in find_1_byte. The same problems
will occur when we try to parse text-based protocol, such as exchange quote
data or XML. Erlang can do much better on these tasks.

What do you thing?

Thanks,
Vlad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/eeps/attachments/20080127/09107f73/attachment.htm>