Optimization help required: find 0,0,1 code in binary

Wed Feb 9 08:53:19 CET 2011

Hi! I'd wish to remove NIF code from erlyvideo once and thus I want to
try fix some bottlenecks.

First is NAL start code lookup. H.264 video frames are packaged in
byte stream where frames are divided either by 0,0,0,1 either by 0,0,1
It is not very clear when should be used 4-bytes code or when is used
3-bytes, so everybody is looking for 3-bytes code and then checks if
it is 4-byte code.

Here is my erlang implementation:

It starts to look for a first start code from required Offset, than it
looks for second border and extracts binary between them.

https://github.com/erlyvideo/erlyvideo/blob/master/deps/mpegts/src/mpegts_reader.erl#L554

extract_nal_erl(Data) ->
  extract_nal_erl(Data, 0).

extract_nal_erl(Data, Offset) ->
  case Data of
    <<_:Offset/binary>> ->
      undefined;
    <<_:Offset/binary, 1:32, _Rest/binary>> ->
      extract_nal_erl(Data, Offset + 1, 0);
    <<_:Offset/binary, 1:24, _Rest/binary>> ->
      extract_nal_erl(Data, Offset, 0);
    _ ->
      extract_nal_erl(Data, Offset+1)
  end.

extract_nal_erl(Data, Offset, Length) ->
  case Data of
    <<_:Offset/binary, 1:24, NAL:Length/binary, 1:32, _/binary>> ->
      <<_:Offset/binary, 1:24, NAL:Length/binary, Rest/binary>> = Data,
      {ok, NAL, Rest};
    <<_:Offset/binary, 1:24, NAL:Length/binary, 1:24, _/binary>> ->
      <<_:Offset/binary, 1:24, NAL:Length/binary, Rest/binary>> = Data,
      {ok, NAL, Rest};
    <<_:Offset/binary, 1:24, NAL:Length/binary, 0>> ->
      {ok, NAL, <<>>};
    <<_:Offset/binary, 1:24, NAL:Length/binary>> ->
      {ok, NAL, <<>>};
    <<_:Offset/binary, 1:24, _/binary>> ->
      extract_nal_erl(Data, Offset, Length+1)
  end.

Here is my NIF implementation:

https://github.com/erlyvideo/erlyvideo/blob/master/deps/mpegts/src/mpegts_reader.c#L55

static int
find_nal(ErlNifBinary data, int i, int border) {
  if (i + 3 >= data.size) {
    return -1;
  }
  for(; i + 2 < data.size; i++) {
    if (i + 3 < data.size && data.data[i] == 0 && data.data[i+1] == 0
&& data.data[i+2] == 0 && data.data[i+3] == 1) {
      return border == START ? i + 4 : i;
    }
    if (data.data[i] == 0 && data.data[i+1] == 0 && data.data[i+2] == 1) {
      return border == START ? i + 3 : i;
    }
  }
  return -1;
}

It is extremely faster:

(ems@REDACTED)7> mpegts_reader:benchmark().
debug 2011-02-09 10:36:32 [mpegts] mpegts_reader:655 {"Timer
erl",100000,2456324}
debug 2011-02-09 10:36:32 [mpegts] mpegts_reader:664 {"Timer
native",100000,65452}
ok

C implementation is 40 times faster than erlang one. It looks for me a
bit strange because this is exactly the same difference between these
two implementations:

https://github.com/erlyvideo/erlyvideo/blob/master/deps/mpegts/src/mpeg2_crc32.c
https://github.com/erlyvideo/erlyvideo/blob/master/deps/mpegts/src/mpeg2_crc32.erl

I have also tested hipe (with commented out on_load function):

(ems@REDACTED)10> c("deps/mpegts/src/mpegts_reader.erl", [{ebin,
"deps/mpegts/ebin"},native]).
deps/mpegts/src/mpegts_reader.erl:90: Warning: function load_nif/0 is unused
deps/mpegts/src/mpegts_reader.erl:93: Warning: function load_nif/1 is unused
{ok,mpegts_reader}
(ems@REDACTED)11> mpegts_reader:benchmark().
debug 2011-02-09 10:50:40 [mpegts] mpegts_reader:655 {"Timer erl",100000,882046}
debug 2011-02-09 10:50:41 [mpegts] mpegts_reader:664 {"Timer
native",100000,878927}
ok

So we get 3 times speed up from erlang bytecode but still it is 10
times slower than C

So my questions are:

1) where should I look and how should I profile erlang code to speed
up my erlang implementation of byte pattern lookup ?
2) Are there any tricks like put on all possible guards and help hipe
to compile more effective code