Binary matching
Jay Nelson
jay@REDACTED
Mon Sep 30 01:19:41 CEST 2002
I am trying to use port communications and do as much
of the parsing / matching as possible on the binary data
to keep the speed up. The problem is that I either don't
understand it, have some missing syntax, or it just doesn't
work as documented.
My initial testing was to use a web browser to contact the
active port, receive the GET request and split it into separate
lines (breaking on cr / lf) and printing it out to see the result.
This would be an easy way to see how to receive and parse
binary data.
Of course, the first thing I noticed was that I can only go
from a binary to a list of ints or a list of ints to a binary.
There is no documented way to print a binary as string
which makes things a bit difficult to review.
Ignoring that, I forged ahead with a little practice matching
in the interpreter:
Eshell V5.1.2 (abort with ^G)
1> Bin = <<"just a test">>.
<<106,117,115,116,32,97,32,116,101,115,116>>
2> <<W1:32/binary, Rest/binary>> = Bin.
** exited: {{badmatch,<<106,117,115,116,32,97,32,116,101,115,116>>},
[{erl_eval,expr,3}]} **
3> <<W1:32, Rest/binary>> = Bin.
<<106,117,115,116,32,97,32,116,101,115,116>>
4> b().
Bin = <<106,117,115,116,32,97,32,116,101,115,116>>
Rest = <<32,97,32,116,101,115,116>>
W1 = 1786082164
ok
OK, I can stick a string in a binary but I can't split it in two
by specifying a binary length and then getting the rest, but
I can pull of a 32-bit int and get the rest as a binary.
5> <<Beginning/binary, 32, End/binary>> = Bin.
** exited: {{badmatch,<<106,117,115,116,32,97,32,116,101,115,116>>},
[{erl_eval,expr,3}]} **
6> <<106, 117,115, 116, 32, End/binary>> = Bin.
<<106,117,115,116,32,97,32,116,101,115,116>>
7> b().
Bin = <<106,117,115,116,32,97,32,116,101,115,116>>
End = <<97,32,116,101,115,116>>
Rest = <<32,97,32,116,101,115,116>>
W1 = 1786082164
10> os:version().
{4,10,67766222}
11> os:type().
{win32,windows}
and so on... Various tests basically conclude that I can only
match with the last term as a binary, and that the beginning must
match exactly byte for byte or else I get badmatch. I cannot even
specify an exact binary length for the leading segments, although
I can specify a length and strip it off as a big integer
(e.g., <<Bignum:80, Rest/binary>> = Data).
Do I have the wrong version or is this the intended functionality?
(R8B2 on both Windows 98 and Red Hat 7.3 compiled from scratch
both running Erlang 5.1.2). I read the documentation a little more
closely as I was writing this and it didn't say that the sizes had to
be bound, but if they are specified they must be bound. It implied
that <<B1:32/binary, B2:32/binary>> = EightBytes was allowed.
I want to write:
breakLines(Binary) -> lines(Binary, []).
lines(<<>>, Acc) -> lists:reverse(Acc).
lines(<<Line/binary, 13, 10, Rest/binary>>, Acc) ->
lines(Rest, [Line | Acc]).
Seems straightforward and not too difficult for the pattern-matcher,
but then I'm no compiler writer.
Instead I am left with converting to a list of ints and writing a few
helper functions to loop over the list pulling out chars using two
accumulators (one for the current line and one for the list of lines).
Am I too worried about efficiency? Should I forget about binaries?
Am I doing something wrong?
Is it more efficient to make a pass
across the binary looking for the location of all <<13, 10>> pairs, returning
a list of the number of bytes between them and then doing binary
matching now that I know how many bytes to specify on the initial
patterns?
<little coding break for a couple hours>
I tried this and ended up with the same problem:
-------------------------- cut here --------------------------
-module(bin_utils).
-export([breakLines/1,extract/3,scan/6]).
breakLines(Binary) ->
StartStop = scan(Binary, <<13,10>>, 16, 0, 0, []),
extract(Binary, StartStop, []).
extract(<<>>, _Locs, Acc) -> lists:reverse(Acc);
extract(Data, [Start, Stop | Rest], Acc) ->
Len = Stop - Start,
%%%%%%%%% Here is the problem %%%%%%%%%%%%
%% It does little good to get Line as an int %%
<<Front:Start, Line:Len/binary, Back/binary>> = Data,
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
extract(Back, Rest, [Line | Acc]).
scan(<<>>, _Pattern, _Len, 0, 0, Acc) -> lists:reverse(Acc);
scan(<<>>, _Pattern, _Len, Start, End, Acc) -> lists:reverse([End | [Start
| Acc]]);
scan(<<Pattern/binary, Rest/binary>>, Pattern, Len, Start, End, Acc) ->
scan(Rest, Pattern, Len, End + Len, End + Len, [End | [Start | Acc]]);
scan(Data, Pattern, Len, Start, End, Acc) ->
<<Nomatch, Rest/binary>> = Data,
scan(Rest, Pattern, Len, Start, End + 8, Acc).
---------------------- end cut ------------------------------
Does this avoid copying the binary (except for inside the function
extract)? Is looping over the binary more efficient than looping over
a list of integers?
NOTE: There is a bug in my scan function because the following
doesn't work:
136> CRLF = <<13, 10>>.
<<13, 10>>
137> Test = <<67, 13, 10>>.
<<67, 13, 10>>
%%%%%%%%%%% This looks good!
138> bin_utils:scan(Test, CRLF, 16, 0, 0, []).
[0, 8, 24, 24]
139> L1 = <<"how are you">>.
<<104,111,119,32,97,114,101,32,121,111,117>>
140> L2 = <<"just fine.">>.
<<106,117,115,116,32,102,105,110,101,46>>
141> L3 = <<"how about you?">>.
<<104,111,119,32,97,98,111,117,116,32,121,111,117,63>>
142> L4 = <<L1/binary, CRLF/binary, L2/binary, CRLF/binary, L3/binary>>.
<<104,111,119,32,97,114,101,32,121,111,117,13,10,106,117,115,116,32,102,105,
110,101,46,13,10,104,111,119,32,...>>
%%%%%%%%%% Ooops.
143> bin_utils:scan(L4, CRLF, 16, 0, 0, []).
[0,312]
I can't even figure out how to determine the length of
the binary without converting it to a list.
jay
---------------------------------------------------
DuoMark International, Inc.
6523 Colgate Avenue, Suite 325
Los Angeles, CA 90048-4410 / USA
Voice: +1 323 381-0001
FAX: +1 323 549 0172
Email: jay@REDACTED
WWW: http://www.duomark.com/
More information about the erlang-questions
mailing list