[erlang-questions] Keeping massive concurrency when interfacing with C

Thu Oct 6 13:28:25 CEST 2011

On Wed, Oct 5, 2011 at 5:55 PM, Alceste Scalas <alceste@REDACTED> wrote:
>
> Hi, I'm one of the authors of the "HPTC with Erlang" work.
> You're right, nothing was publicly released except for the FFI
> implementation described in EEP 7 --- and since the project ended
> last year, I believe that nothing else will be released in the
> future.

Its a pity we have to start over with this.  I fear the number of
people interested in numerical (realtime) capabilities is limited, so
we will invent part of the wheel again and again.

>> The price you have to pay for the slapped on heavyweight
>> library is that these usually don't scale up to the number of
>> processes Erlang can handle.
>
> IMHO it mostly depends on:
>
>    1. the size of the operands you're working on;
>
>    2. the complexity of the foreign functions you're going to
>       call.

I agree.  Thats one thing that makes it hard to find a generic
solution to numerics needs in Erlang.

> Our project was primarily focused on real-time numerical
> computing, and thus we needed a method for quickly calling
> "simple" numerical foreign functions (such as multiplications of
> relatively small (portions of) matrices).  Those functions, taken
> alone, would usually return almost immediately: in other words,
> their execution time was similar to that of regular BIFs.  We
> used BLAS because its optimized implementations are usually
> "fast enough", but (if necessary) we could have developed
> our own optimized C code.

I was looking at more heavyweight BLAS implementations which do
internal thread management to use the cores.  Should have looked at
simpler BLAS implementations which are thread safe and single
threaded.

> When more complicated formulas are assembled with repeated FFI
> calls to those simple functions, then the Erlang scheduler can
> kick in several times before the final result is obtained, thus
> guaranteeing VM responsiveness (albeit reducing the general
> numerical throughput).

The unavoidable trade off between real-time and throughput optimization.

> If the native calls performed by those 20k Erlang processes are
> not "heavy" enough, then introducing work queues may actually
> increase the Erlang VM load and internal lock contention, thus
> decreasing responsiveness (wrt plain NIF calls).  I suspect that
> some comparative benchmarking could be useful.

I'm currently experimenting with a n-dim array module in Erlang that
uses the metadata + binary buffer approach.  Building all operations I
need in pure Erlang first and find places to optimized in NIF's.
Probably won't use a external library since my numerics needs are
pretty specialized (e.g. lots of multiplying bit vectors with float
matrices).

> Maybe a next-generation, general-pourpose numerical computing
> module for Erlang could adopt different strategies depending on
> the size of the operands passed to its functions:
>
>  1. if the vectors/matrices are "small enough", then the native
>     code could be called directly using NIFs;

This automation would probably be machine dependent.  I can imagine
that the basic matrix operations can be handled like this, probably
auto-split in sub-matrix operations.  Probably needs some learning
phase to find the characteristics of the machine.  Basically the ratio
NIF overhead vs. BLAS-speed has to be measured.

>  2. otherwise, the operands could be passed to a separate worker
>     thread, which will later send back its result to the waiting
>     Erlang process (using enif_send()).
>
> In the second case, the future NIF extensions planned by OTP
> folks may be very useful --- see Rickard Green's talk at the SF
> Bay Area Erlang Factory 2011: http://bit.ly/eH61tX

This would be useful for an intermediate runtime of routines.

>> For real heavy numerical stuff I think the best way is to do
>> this in the systems are built for this and interface them
>> somehow to erlang with ports or sockets.
>
> Sure, but the problem with this approach is that you may need to
> constantly (de)serialize and transfer large numerical arrays
> among the Erlang VM and the external number crunching systems,
> thus wasting processor cycles, and memory/network bandwidth.

For runtimes in the minutes to hours range and very complicated code
this is probably still the way to go.  There is always the question of
Erlang VM stability for the heavy numerical stuff.  Ports are very
nice from the dependability standpoint.  Which is probably a issue for
the trading system example that initiated the thread.

For my application I'll start from the Erlang side trying to define a
nice API for n-dimensional fixed element size (sub byte sizes allowed)
matrices with some basic operations defined for them.  Then I'll look
at the minimum amount of NIF support needed to make this run at least
at modest speeds.  I'll publish my code early hoping others might join
in.

Regards
Peer Stritzinger

>
> Regards,
> --
> Alceste Scalas <alceste@REDACTED>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>