[erlang-questions] Problem with pattern matching in large binaries

Edwin Fine erlang-questions_efine@REDACTED
Mon May 5 23:08:46 CEST 2008


 I am really baffled with this, and it smells like a bug to me, but I
 am quite new to Erlang and don't want to make any assumptions, so I am
 throwing this open to the community.

 I am using Erlang R12B-2, compiled on Ubuntu Linux Feisty (64-bit,
 x86_64), Core 2 Quad E6600, 8 GB RAM.

 I have a text file that is 1,037,563,663 bytes in length. In the
 shell, I read it all into memory as follows:

 >{ok,B} = file:read_file("/tmp/data").
 ...
 > byte_size(B).
 1037563663

  So far so good. Then I decide I want to look at the last 100 bytes of
 the binary.

 >Offset = byte_size(B) - 100.
 1037563563

 Makes sense. Now to skip the first Offset bytes:

 > <<_Skip:Offset/binary,Last100/binary>> = B.
  > byte_size(Last100).
 100
 > byte_size(_Skip).
 500692651

 WTF??? Checking Last100 showed that it was indeed the data from offset
 500693651, not the last 100 bytes. Where did the other 500MB-odd go?

 Looking on Google revealed that if there is any binary matching limit,
 it is 2 ^ 27 bytes, which this is smaller than. And this limit is
 supposedly on 32 bits, and I am using 64 bits.

 Then I tried this:

  >{B1,B2} = split_binary(B, byte_size(B) div 2).
 > byte_size(B1).
 518781831
 > byte_size(B2).
 518781832
 ><<_Yipes2:518781700/binary,Last132/binary>> = B2.

 Checking Last132 showed that it was actually the last 132 bytes of the
 file. So it's not file:read_file misbehaving - it did read the whole
 file into B. It seems to be the size component of a pattern match that
 has some limitation.

 I have looked in the Advanced section (9) of the Erlang Efficiency
 guide, and the limit of a binary match on a 64 bit system is supposed
 to be 2305843009213693951 bytes.

 Is this an undocumented bug?



More information about the erlang-questions mailing list