[erlang-questions] matching binary against a file

Thu Jun 21 23:34:04 CEST 2007

Mike McNally wrote:
>> But what is the best way for implementing this other process (which I 
>> imagine ends up looking something like 'file' but with different 
>> functions for reading different data types)?  Catch the match and read 
>> the next block on a badmatch?  Is this kind of "if it fails, do 
>> something and try it again" pattern properly erlish?
> 
> (I hope I understand what you're asking correctly ...)

As I imagine most newbies do, I struggle a bit with getting away from 
the imperative/iterative mindset.  What I'm thinking of as a simple 
while loop, something like

while ((result = scan_buffer()) == error) {
    fill_buffer()
}
return result

doesn't really seem to map directly to match declarations and recursion. 
  So I'm trying to figure out what it is I really want to do.

> It looks something like:
> 
> if at eof, send the "eof" message;
> if the current block is empty, recurse with the next block;
> if the current block is not empty,
>   examine one or more bytes to find the UTF-8 value
>   send the value to the client process
>   recurse with the new block position
> 
> There's a simple and efficient technique for iterating over a binary
> without continually having to make a new one.  The above would look like
> 
> utf(Client, _C, eof) -> Client ! eof;
> utf(Client, C, <<_:C>>) -> utf(Client, C, read_next_block());
> utf(Client, C, Block = <<_:C, CurChar, Remainder/binary>>) ->
>   %% utf interpretation here
>   utf(Client, NewC, Block).

I'm a bit fuzzy on the omitted details :)   Won't this end up eventually 
reading the entire file into memory as Block keeps getting appended to? 
    C here works as an offset into the block, correct?

> (I left out some details; for example the utf function probably has to
> cart around the file handle.)
> 
> 
> As to "reading different data types", I'm not sure what you mean.  The
> file is full of UTF-8 characters, no?  If not then I'm confused.

The file I'm reading is an apache lucene index file, which has "Vint" 
encoded integers (1-5 bytes), bytes, signed and unsigned ints, unsigned 
64-bit ints, and strings, which are a Vint length in chars follows by 
that number (modified) utf-8 encoded chars.  So I have read_vint, 
read_utf8, read_string, read_u64, read_i32, read_s32, read_b and a 
helper read_types to take a list of types and read those from the 
stream.  The file format dictates what type you expect to see at any 
given point.  So its more of decoding a weirdly packed binary file than 
just reading characters.

Thanks,
-J