[erlang-questions] NIF vs Erlang Binary

Fri Jul 22 13:03:57 CEST 2011

On Fri, Jul 22, 2011 at 11:45, Andy W. Song <wsongcn@REDACTED> wrote:

> I did some unit test on my code and felt that it's slow (it can process
> about  24M byte/s) on a virtual machine. HiPE can double the performance but
> still not quite enough. So I wrote an NIF to handle this. The speed is about
> 10~15x faster. Not only that, I feel that the C code is easier to write.

Blindly unrolling the Key a bit gives a factor of 3 speedup:

mask(Key, Data, Accu) ->
    K = binary:copy(<<Key:32>>, 512 div 32),
    <<LongKey:512>> = K,
    mask(Key, LongKey, Data, Accu).

mask(Key, LongKey, Data,Accu) ->
    case Data of
        <<A:512, Rest/binary>> ->
            C = binary:encode_unsigned(A bxor LongKey),
            mask(Key, LongKey, Rest, <<Accu/binary, C/binary>>);
        <<A:32,Rest/binary>> ->
            C = binary:encode_unsigned(A bxor Key),
            mask(Key,LongKey,Rest,<<Accu/binary,C/binary>>);
        <<A:24>> ->
            <<B:24, _:8>> = binary:encode_unsigned(Key),
            C = binary:encode_unsigned(A bxor B),
            <<Accu/binary,C/binary>>;
        <<A:16>> ->
            <<B:16, _:16>> = binary:encode_unsigned(Key),
            C = binary:encode_unsigned(A bxor B),
            <<Accu/binary,C/binary>>;
        <<A:8>> ->
            <<B:8, _:24>> = binary:encode_unsigned(Key),
            C = binary:encode_unsigned(A bxor B),
            <<Accu/binary,C/binary>>;
        <<>> ->
            Accu
    end.

Why the call to binary:encode_unsigned? Lets alter that pattern:

    case Data of
        <<A:512, Rest/binary>> ->
            C = A bxor LongKey,
            mask(Key, LongKey, Rest, <<Accu/binary, C:512>>);

Now it is 5 times faster, same result. The NIF-advantage is now a
factor of 2-3. That is in the ballpark I would expect it to be. You
are doing many more reallocations with the above solution. Then the C
NIF version. What happens if we tune it some more? Lets do runs of
8192 bits at a time...

9 times faster compared to the original here! I expect our speed will
converge to that of C if we turn it up even more and get the amount of
allocation/realloc/concatenation down.

-- 
J.