Binary matching
Per Bergqvist
per@REDACTED
Mon Sep 30 10:01:25 CEST 2002
Hi Jay,
you should specify the size of the binary in units not in bits.
For a binary the unit size is 8 bits.
I.e. use:
<<W1:4/binary, Rest/binary>> = Bin.
/Per
> I am trying to use port communications and do as much
> of the parsing / matching as possible on the binary data
> to keep the speed up. The problem is that I either don't
> understand it, have some missing syntax, or it just doesn't
> work as documented.
>
> My initial testing was to use a web browser to contact the
> active port, receive the GET request and split it into separate
> lines (breaking on cr / lf) and printing it out to see the result.
> This would be an easy way to see how to receive and parse
> binary data.
>
> Of course, the first thing I noticed was that I can only go
> from a binary to a list of ints or a list of ints to a binary.
> There is no documented way to print a binary as string
> which makes things a bit difficult to review.
>
> Ignoring that, I forged ahead with a little practice matching
> in the interpreter:
>
> Eshell V5.1.2 (abort with ^G)
> 1> Bin = <<"just a test">>.
> <<106,117,115,116,32,97,32,116,101,115,116>>
> 2> <<W1:32/binary, Rest/binary>> = Bin.
> ** exited: {{badmatch,<<106,117,115,116,32,97,32,116,101,115,116>>},
> [{erl_eval,expr,3}]} **
> 3> <<W1:32, Rest/binary>> = Bin.
> <<106,117,115,116,32,97,32,116,101,115,116>>
> 4> b().
> Bin = <<106,117,115,116,32,97,32,116,101,115,116>>
> Rest = <<32,97,32,116,101,115,116>>
> W1 = 1786082164
> ok
>
> OK, I can stick a string in a binary but I can't split it in two
> by specifying a binary length and then getting the rest, but
> I can pull of a 32-bit int and get the rest as a binary.
>
> 5> <<Beginning/binary, 32, End/binary>> = Bin.
> ** exited: {{badmatch,<<106,117,115,116,32,97,32,116,101,115,116>>},
> [{erl_eval,expr,3}]} **
> 6> <<106, 117,115, 116, 32, End/binary>> = Bin.
> <<106,117,115,116,32,97,32,116,101,115,116>>
> 7> b().
> Bin = <<106,117,115,116,32,97,32,116,101,115,116>>
> End = <<97,32,116,101,115,116>>
> Rest = <<32,97,32,116,101,115,116>>
> W1 = 1786082164
> 10> os:version().
> {4,10,67766222}
> 11> os:type().
> {win32,windows}
>
> and so on... Various tests basically conclude that I can only
> match with the last term as a binary, and that the beginning must
> match exactly byte for byte or else I get badmatch. I cannot even
> specify an exact binary length for the leading segments, although
> I can specify a length and strip it off as a big integer
> (e.g., <<Bignum:80, Rest/binary>> = Data).
>
> Do I have the wrong version or is this the intended functionality?
> (R8B2 on both Windows 98 and Red Hat 7.3 compiled from scratch
> both running Erlang 5.1.2). I read the documentation a little more
> closely as I was writing this and it didn't say that the sizes had
to
> be bound, but if they are specified they must be bound. It implied
> that <<B1:32/binary, B2:32/binary>> = EightBytes was allowed.
>
> I want to write:
>
> breakLines(Binary) -> lines(Binary, []).
>
> lines(<<>>, Acc) -> lists:reverse(Acc).
> lines(<<Line/binary, 13, 10, Rest/binary>>, Acc) ->
> lines(Rest, [Line | Acc]).
>
> Seems straightforward and not too difficult for the pattern-matcher,
> but then I'm no compiler writer.
>
> Instead I am left with converting to a list of ints and writing a
few
> helper functions to loop over the list pulling out chars using two
> accumulators (one for the current line and one for the list of
lines).
>
> Am I too worried about efficiency? Should I forget about binaries?
> Am I doing something wrong?
>
> Is it more efficient to make a pass
> across the binary looking for the location of all <<13, 10>> pairs,
returning
> a list of the number of bytes between them and then doing binary
> matching now that I know how many bytes to specify on the initial
> patterns?
>
> <little coding break for a couple hours>
>
> I tried this and ended up with the same problem:
> -------------------------- cut here --------------------------
>
> -module(bin_utils).
> -export([breakLines/1,extract/3,scan/6]).
>
> breakLines(Binary) ->
> StartStop = scan(Binary, <<13,10>>, 16, 0, 0, []),
> extract(Binary, StartStop, []).
>
> extract(<<>>, _Locs, Acc) -> lists:reverse(Acc);
> extract(Data, [Start, Stop | Rest], Acc) ->
> Len = Stop - Start,
> %%%%%%%%% Here is the problem %%%%%%%%%%%%
> %% It does little good to get Line as an int %%
> <<Front:Start, Line:Len/binary, Back/binary>> = Data,
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> extract(Back, Rest, [Line | Acc]).
>
> scan(<<>>, _Pattern, _Len, 0, 0, Acc) -> lists:reverse(Acc);
> scan(<<>>, _Pattern, _Len, Start, End, Acc) -> lists:reverse([End |
[Start
> | Acc]]);
> scan(<<Pattern/binary, Rest/binary>>, Pattern, Len, Start, End, Acc)
->
> scan(Rest, Pattern, Len, End + Len, End + Len, [End | [Start |
Acc]]);
> scan(Data, Pattern, Len, Start, End, Acc) ->
> <<Nomatch, Rest/binary>> = Data,
> scan(Rest, Pattern, Len, Start, End + 8, Acc).
>
> ---------------------- end cut ------------------------------
>
>
> Does this avoid copying the binary (except for inside the function
> extract)? Is looping over the binary more efficient than looping
over
> a list of integers?
>
> NOTE: There is a bug in my scan function because the following
> doesn't work:
>
> 136> CRLF = <<13, 10>>.
> <<13, 10>>
>
> 137> Test = <<67, 13, 10>>.
> <<67, 13, 10>>
>
> %%%%%%%%%%% This looks good!
> 138> bin_utils:scan(Test, CRLF, 16, 0, 0, []).
> [0, 8, 24, 24]
>
> 139> L1 = <<"how are you">>.
> <<104,111,119,32,97,114,101,32,121,111,117>>
>
> 140> L2 = <<"just fine.">>.
> <<106,117,115,116,32,102,105,110,101,46>>
>
> 141> L3 = <<"how about you?">>.
> <<104,111,119,32,97,98,111,117,116,32,121,111,117,63>>
>
> 142> L4 = <<L1/binary, CRLF/binary, L2/binary, CRLF/binary,
L3/binary>>.
>
<<104,111,119,32,97,114,101,32,121,111,117,13,10,106,117,115,116,32,10
2,105,
> 110,101,46,13,10,104,111,119,32,...>>
>
> %%%%%%%%%% Ooops.
> 143> bin_utils:scan(L4, CRLF, 16, 0, 0, []).
> [0,312]
>
>
>
>
> I can't even figure out how to determine the length of
> the binary without converting it to a list.
>
> jay
>
>
> ---------------------------------------------------
> DuoMark International, Inc.
> 6523 Colgate Avenue, Suite 325
> Los Angeles, CA 90048-4410 / USA
> Voice: +1 323 381-0001
> FAX: +1 323 549 0172
> Email: jay@REDACTED
> WWW: http://www.duomark.com/
>
=========================================================
Per Bergqvist
Synapse Systems AB
Phone: +46 709 686 685
Email: per@REDACTED
More information about the erlang-questions
mailing list