Binary matching

Jay Nelson jay@REDACTED
Mon Sep 30 01:19:41 CEST 2002

I am trying to use port communications and do as much
of the parsing / matching as possible on the binary data
to keep the speed up.  The problem is that I either don't
understand it, have some missing syntax, or it just doesn't
work as documented.

My initial testing was to use a web browser to contact the
active port, receive the GET request and split it into separate
lines (breaking on cr / lf) and printing it out to see the result.
This would be an easy way to see how to receive and parse
binary data.

Of course, the first thing I noticed was that I can only go
from a binary to a list of ints or a list of ints to a binary.
There is no documented way to print a binary as string
which makes things a bit difficult to review.

Ignoring that, I forged ahead with a little practice matching
in the interpreter:

Eshell V5.1.2  (abort with ^G)
1> Bin = <<"just a test">>.
2> <<W1:32/binary, Rest/binary>> = Bin.
** exited: {{badmatch,<<106,117,115,116,32,97,32,116,101,115,116>>},
             [{erl_eval,expr,3}]} **
3> <<W1:32, Rest/binary>> = Bin.
4> b().
Bin = <<106,117,115,116,32,97,32,116,101,115,116>>
Rest = <<32,97,32,116,101,115,116>>
W1 = 1786082164

OK, I can stick a string in a binary but I can't split it in two
by specifying a binary length and then getting the rest, but
I can pull of a 32-bit int and get the rest as a binary.

5> <<Beginning/binary, 32, End/binary>> = Bin.
** exited: {{badmatch,<<106,117,115,116,32,97,32,116,101,115,116>>},
             [{erl_eval,expr,3}]} **
6> <<106, 117,115, 116, 32, End/binary>> = Bin.
7> b().
Bin = <<106,117,115,116,32,97,32,116,101,115,116>>
End = <<97,32,116,101,115,116>>
Rest = <<32,97,32,116,101,115,116>>
W1 = 1786082164
10> os:version().
11> os:type().

and so on...   Various tests basically conclude that I can only
match with the last term as a binary, and that the beginning must
match exactly byte for byte or else I get badmatch.  I cannot even
specify an exact binary length for the leading segments, although
I can specify a length and strip it off as a big integer
(e.g., <<Bignum:80, Rest/binary>> = Data).

Do I have the wrong version or is this the intended functionality?
(R8B2 on both Windows 98 and Red Hat 7.3 compiled from scratch
both running Erlang 5.1.2).  I read the documentation a little more
closely as I was writing this and it didn't say that the sizes had to
be bound, but if they are specified they must be bound.  It implied
that <<B1:32/binary, B2:32/binary>> = EightBytes was allowed.

I want to write:

breakLines(Binary) -> lines(Binary, []).

lines(<<>>, Acc) -> lists:reverse(Acc).
lines(<<Line/binary, 13, 10, Rest/binary>>, Acc) ->
	lines(Rest, [Line | Acc]).

Seems straightforward and not too difficult for the pattern-matcher,
but then I'm no compiler writer.

Instead I am left with converting to a list of ints and writing a few
helper functions to loop over the list pulling out chars using two
accumulators (one for the current line and one for the list of lines).

Am I too worried about efficiency?  Should I forget about binaries?
Am I doing something wrong?

Is it more efficient to make a pass
across the binary looking for the location of all <<13, 10>> pairs, returning
a list of the number of bytes between them and then doing binary
matching now that I know how many bytes to specify on the initial

<little coding break for a couple hours>

I tried this and ended up with the same problem:
-------------------------- cut here --------------------------


breakLines(Binary) ->
     StartStop = scan(Binary, <<13,10>>, 16, 0, 0, []),
     extract(Binary, StartStop, []).

extract(<<>>, _Locs, Acc) -> lists:reverse(Acc);
extract(Data, [Start, Stop | Rest], Acc) ->
     Len = Stop - Start,
     %%%%%%%%%  Here is the problem %%%%%%%%%%%%
     %% It does little good to get Line as an int %%
     <<Front:Start, Line:Len/binary, Back/binary>> = Data,
     extract(Back, Rest, [Line | Acc]).

scan(<<>>, _Pattern, _Len, 0, 0, Acc) -> lists:reverse(Acc);
scan(<<>>, _Pattern, _Len, Start, End, Acc) -> lists:reverse([End | [Start 
| Acc]]);
scan(<<Pattern/binary, Rest/binary>>, Pattern, Len, Start, End, Acc) ->
     scan(Rest, Pattern, Len, End + Len, End + Len, [End | [Start | Acc]]);
scan(Data, Pattern, Len, Start, End, Acc) ->
     <<Nomatch, Rest/binary>> = Data,
     scan(Rest, Pattern, Len, Start, End + 8, Acc).

---------------------- end cut ------------------------------

Does this avoid copying the binary (except for inside the function
extract)?  Is looping over the binary more efficient than looping over
a list of integers?

NOTE: There is a bug in my scan function because the following
doesn't work:

136> CRLF = <<13, 10>>.
<<13, 10>>

137> Test = <<67, 13, 10>>.
<<67, 13, 10>>

%%%%%%%%%%% This looks good!
138> bin_utils:scan(Test, CRLF, 16, 0, 0, []).
[0, 8, 24, 24]

139> L1 = <<"how are you">>.

140> L2 = <<"just fine.">>.

141> L3 = <<"how about you?">>.

142> L4 = <<L1/binary, CRLF/binary, L2/binary, CRLF/binary, L3/binary>>.

%%%%%%%%%%  Ooops.
143> bin_utils:scan(L4, CRLF, 16, 0, 0, []).

I can't even figure out how to determine the length of
the binary without converting it to a list.


DuoMark International, Inc.
6523 Colgate Avenue, Suite 325
Los Angeles, CA  90048-4410 / USA
Voice: +1 323 381-0001
FAX: +1 323 549 0172
Email: jay@REDACTED

More information about the erlang-questions mailing list