[erlang-questions] Problem with pattern matching in large binaries
Edwin Fine
erlang-questions_efine@REDACTED
Mon May 5 23:08:46 CEST 2008
I am really baffled with this, and it smells like a bug to me, but I
am quite new to Erlang and don't want to make any assumptions, so I am
throwing this open to the community.
I am using Erlang R12B-2, compiled on Ubuntu Linux Feisty (64-bit,
x86_64), Core 2 Quad E6600, 8 GB RAM.
I have a text file that is 1,037,563,663 bytes in length. In the
shell, I read it all into memory as follows:
>{ok,B} = file:read_file("/tmp/data").
...
> byte_size(B).
1037563663
So far so good. Then I decide I want to look at the last 100 bytes of
the binary.
>Offset = byte_size(B) - 100.
1037563563
Makes sense. Now to skip the first Offset bytes:
> <<_Skip:Offset/binary,Last100/binary>> = B.
> byte_size(Last100).
100
> byte_size(_Skip).
500692651
WTF??? Checking Last100 showed that it was indeed the data from offset
500693651, not the last 100 bytes. Where did the other 500MB-odd go?
Looking on Google revealed that if there is any binary matching limit,
it is 2 ^ 27 bytes, which this is smaller than. And this limit is
supposedly on 32 bits, and I am using 64 bits.
Then I tried this:
>{B1,B2} = split_binary(B, byte_size(B) div 2).
> byte_size(B1).
518781831
> byte_size(B2).
518781832
><<_Yipes2:518781700/binary,Last132/binary>> = B2.
Checking Last132 showed that it was actually the last 132 bytes of the
file. So it's not file:read_file misbehaving - it did read the whole
file into B. It seems to be the size component of a pattern match that
has some limitation.
I have looked in the Advanced section (9) of the Erlang Efficiency
guide, and the limit of a binary match on a 64 bit system is supposed
to be 2305843009213693951 bytes.
Is this an undocumented bug?
More information about the erlang-questions
mailing list