[erlang-questions] matching binary against a file
Jeff Rogers
dvrsn@REDACTED
Thu Jun 21 23:34:04 CEST 2007
Mike McNally wrote:
>> But what is the best way for implementing this other process (which I
>> imagine ends up looking something like 'file' but with different
>> functions for reading different data types)? Catch the match and read
>> the next block on a badmatch? Is this kind of "if it fails, do
>> something and try it again" pattern properly erlish?
>
> (I hope I understand what you're asking correctly ...)
As I imagine most newbies do, I struggle a bit with getting away from
the imperative/iterative mindset. What I'm thinking of as a simple
while loop, something like
while ((result = scan_buffer()) == error) {
fill_buffer()
}
return result
doesn't really seem to map directly to match declarations and recursion.
So I'm trying to figure out what it is I really want to do.
> It looks something like:
>
> if at eof, send the "eof" message;
> if the current block is empty, recurse with the next block;
> if the current block is not empty,
> examine one or more bytes to find the UTF-8 value
> send the value to the client process
> recurse with the new block position
>
> There's a simple and efficient technique for iterating over a binary
> without continually having to make a new one. The above would look like
>
> utf(Client, _C, eof) -> Client ! eof;
> utf(Client, C, <<_:C>>) -> utf(Client, C, read_next_block());
> utf(Client, C, Block = <<_:C, CurChar, Remainder/binary>>) ->
> %% utf interpretation here
> utf(Client, NewC, Block).
I'm a bit fuzzy on the omitted details :) Won't this end up eventually
reading the entire file into memory as Block keeps getting appended to?
C here works as an offset into the block, correct?
> (I left out some details; for example the utf function probably has to
> cart around the file handle.)
>
>
> As to "reading different data types", I'm not sure what you mean. The
> file is full of UTF-8 characters, no? If not then I'm confused.
The file I'm reading is an apache lucene index file, which has "Vint"
encoded integers (1-5 bytes), bytes, signed and unsigned ints, unsigned
64-bit ints, and strings, which are a Vint length in chars follows by
that number (modified) utf-8 encoded chars. So I have read_vint,
read_utf8, read_string, read_u64, read_i32, read_s32, read_b and a
helper read_types to take a list of types and read those from the
stream. The file format dictates what type you expect to see at any
given point. So its more of decoding a weirdly packed binary file than
just reading characters.
Thanks,
-J
More information about the erlang-questions
mailing list