Binary matching

Jay Nelson jay@REDACTED
Mon Sep 30 01:19:41 CEST 2002


I am trying to use port communications and do as much
of the parsing / matching as possible on the binary data
to keep the speed up.  The problem is that I either don't
understand it, have some missing syntax, or it just doesn't
work as documented.

My initial testing was to use a web browser to contact the
active port, receive the GET request and split it into separate
lines (breaking on cr / lf) and printing it out to see the result.
This would be an easy way to see how to receive and parse
binary data.

Of course, the first thing I noticed was that I can only go
from a binary to a list of ints or a list of ints to a binary.
There is no documented way to print a binary as string
which makes things a bit difficult to review.

Ignoring that, I forged ahead with a little practice matching
in the interpreter:

Eshell V5.1.2  (abort with ^G)
1> Bin = <<"just a test">>.
<<106,117,115,116,32,97,32,116,101,115,116>>
2> <<W1:32/binary, Rest/binary>> = Bin.
** exited: {{badmatch,<<106,117,115,116,32,97,32,116,101,115,116>>},
             [{erl_eval,expr,3}]} **
3> <<W1:32, Rest/binary>> = Bin.
<<106,117,115,116,32,97,32,116,101,115,116>>
4> b().
Bin = <<106,117,115,116,32,97,32,116,101,115,116>>
Rest = <<32,97,32,116,101,115,116>>
W1 = 1786082164
ok

OK, I can stick a string in a binary but I can't split it in two
by specifying a binary length and then getting the rest, but
I can pull of a 32-bit int and get the rest as a binary.

5> <<Beginning/binary, 32, End/binary>> = Bin.
** exited: {{badmatch,<<106,117,115,116,32,97,32,116,101,115,116>>},
             [{erl_eval,expr,3}]} **
6> <<106, 117,115, 116, 32, End/binary>> = Bin.
<<106,117,115,116,32,97,32,116,101,115,116>>
7> b().
Bin = <<106,117,115,116,32,97,32,116,101,115,116>>
End = <<97,32,116,101,115,116>>
Rest = <<32,97,32,116,101,115,116>>
W1 = 1786082164
10> os:version().
{4,10,67766222}
11> os:type().
{win32,windows}

and so on...   Various tests basically conclude that I can only
match with the last term as a binary, and that the beginning must
match exactly byte for byte or else I get badmatch.  I cannot even
specify an exact binary length for the leading segments, although
I can specify a length and strip it off as a big integer
(e.g., <<Bignum:80, Rest/binary>> = Data).

Do I have the wrong version or is this the intended functionality?
(R8B2 on both Windows 98 and Red Hat 7.3 compiled from scratch
both running Erlang 5.1.2).  I read the documentation a little more
closely as I was writing this and it didn't say that the sizes had to
be bound, but if they are specified they must be bound.  It implied
that <<B1:32/binary, B2:32/binary>> = EightBytes was allowed.

I want to write:

breakLines(Binary) -> lines(Binary, []).

lines(<<>>, Acc) -> lists:reverse(Acc).
lines(<<Line/binary, 13, 10, Rest/binary>>, Acc) ->
	lines(Rest, [Line | Acc]).

Seems straightforward and not too difficult for the pattern-matcher,
but then I'm no compiler writer.

Instead I am left with converting to a list of ints and writing a few
helper functions to loop over the list pulling out chars using two
accumulators (one for the current line and one for the list of lines).

Am I too worried about efficiency?  Should I forget about binaries?
Am I doing something wrong?

Is it more efficient to make a pass
across the binary looking for the location of all <<13, 10>> pairs, returning
a list of the number of bytes between them and then doing binary
matching now that I know how many bytes to specify on the initial
patterns?

<little coding break for a couple hours>

I tried this and ended up with the same problem:
-------------------------- cut here --------------------------

-module(bin_utils).
-export([breakLines/1,extract/3,scan/6]).

breakLines(Binary) ->
     StartStop = scan(Binary, <<13,10>>, 16, 0, 0, []),
     extract(Binary, StartStop, []).

extract(<<>>, _Locs, Acc) -> lists:reverse(Acc);
extract(Data, [Start, Stop | Rest], Acc) ->
     Len = Stop - Start,
     %%%%%%%%%  Here is the problem %%%%%%%%%%%%
     %% It does little good to get Line as an int %%
     <<Front:Start, Line:Len/binary, Back/binary>> = Data,
     %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     extract(Back, Rest, [Line | Acc]).

scan(<<>>, _Pattern, _Len, 0, 0, Acc) -> lists:reverse(Acc);
scan(<<>>, _Pattern, _Len, Start, End, Acc) -> lists:reverse([End | [Start 
| Acc]]);
scan(<<Pattern/binary, Rest/binary>>, Pattern, Len, Start, End, Acc) ->
     scan(Rest, Pattern, Len, End + Len, End + Len, [End | [Start | Acc]]);
scan(Data, Pattern, Len, Start, End, Acc) ->
     <<Nomatch, Rest/binary>> = Data,
     scan(Rest, Pattern, Len, Start, End + 8, Acc).

---------------------- end cut ------------------------------


Does this avoid copying the binary (except for inside the function
extract)?  Is looping over the binary more efficient than looping over
a list of integers?

NOTE: There is a bug in my scan function because the following
doesn't work:

136> CRLF = <<13, 10>>.
<<13, 10>>

137> Test = <<67, 13, 10>>.
<<67, 13, 10>>

%%%%%%%%%%% This looks good!
138> bin_utils:scan(Test, CRLF, 16, 0, 0, []).
[0, 8, 24, 24]

139> L1 = <<"how are you">>.
<<104,111,119,32,97,114,101,32,121,111,117>>

140> L2 = <<"just fine.">>.
<<106,117,115,116,32,102,105,110,101,46>>

141> L3 = <<"how about you?">>.
<<104,111,119,32,97,98,111,117,116,32,121,111,117,63>>

142> L4 = <<L1/binary, CRLF/binary, L2/binary, CRLF/binary, L3/binary>>.
<<104,111,119,32,97,114,101,32,121,111,117,13,10,106,117,115,116,32,102,105, 
110,101,46,13,10,104,111,119,32,...>>

%%%%%%%%%%  Ooops.
143> bin_utils:scan(L4, CRLF, 16, 0, 0, []).
[0,312]




I can't even figure out how to determine the length of
the binary without converting it to a list.

jay


---------------------------------------------------
DuoMark International, Inc.
6523 Colgate Avenue, Suite 325
Los Angeles, CA  90048-4410 / USA
Voice: +1 323 381-0001
FAX: +1 323 549 0172
Email: jay@REDACTED
WWW: http://www.duomark.com/




More information about the erlang-questions mailing list