[erlang-questions] matching binary against a file

Jeff Rogers dvrsn@REDACTED
Thu Jun 21 21:57:55 CEST 2007


I'm just learning erlang and I'm writing a module to read lucene 
indexes.  One problem I keep running into is that the lucene file format 
appears to be explicitly designed to only be readable one byte at a 
time, which makes binary match operations painful.  One example of this 
is reading UTF-8 encoded data (I'm not worrying about modified utf-8 
just yet).

Here is what I'm currently using to a specified count of read utf-8 
encoded characters from an in-memory binary:

read_utf8(C,D) -> read_utf8(C,<<>>,D).
read_utf8(0,S,Data) -> [S,Data];
read_utf8(Cnt,S,<< 2#1110:4, C:20, Data/binary >>) ->
     read_utf8(Cnt-1,<<S/binary,C>>,Data);
read_utf8(Cnt,S,<< 2#110:3, C:13, Data/binary >>) ->
     read_utf8(Cnt-1,<<S/binary,C>>,Data);
read_utf8(Cnt,S,<< 0:1, C:7, Data/binary >>) ->
     read_utf8(Cnt-1,<<S/binary,C>>,Data).

Now this is all fine as long as I first read the entire string into 
memory, but because you can't know beforehand how many bytes will be 
read you have to overestimate and then handle whatever is left over.
Reading the entire multi-megabyte file into a single binary object 
strikes me as inefficient (and may not even be possible in all cases).

So, is there any way to do a binary match against file data directly 
(like you could do with mmap)?  Or am I just taking the wrong approach here?

Thanks
-J



More information about the erlang-questions mailing list