[erlang-questions] matching binary against a file
Jeff Rogers
dvrsn@REDACTED
Thu Jun 21 21:57:55 CEST 2007
I'm just learning erlang and I'm writing a module to read lucene
indexes. One problem I keep running into is that the lucene file format
appears to be explicitly designed to only be readable one byte at a
time, which makes binary match operations painful. One example of this
is reading UTF-8 encoded data (I'm not worrying about modified utf-8
just yet).
Here is what I'm currently using to a specified count of read utf-8
encoded characters from an in-memory binary:
read_utf8(C,D) -> read_utf8(C,<<>>,D).
read_utf8(0,S,Data) -> [S,Data];
read_utf8(Cnt,S,<< 2#1110:4, C:20, Data/binary >>) ->
read_utf8(Cnt-1,<<S/binary,C>>,Data);
read_utf8(Cnt,S,<< 2#110:3, C:13, Data/binary >>) ->
read_utf8(Cnt-1,<<S/binary,C>>,Data);
read_utf8(Cnt,S,<< 0:1, C:7, Data/binary >>) ->
read_utf8(Cnt-1,<<S/binary,C>>,Data).
Now this is all fine as long as I first read the entire string into
memory, but because you can't know beforehand how many bytes will be
read you have to overestimate and then handle whatever is left over.
Reading the entire multi-megabyte file into a single binary object
strikes me as inefficient (and may not even be possible in all cases).
So, is there any way to do a binary match against file data directly
(like you could do with mmap)? Or am I just taking the wrong approach here?
Thanks
-J
More information about the erlang-questions
mailing list