[erlang-questions] Concurrent processes on multi-core platforms with lots of chatter

Tue Dec 1 04:54:58 CET 2009

Off the top of my head, I would expect this to be a process_flag.

Something like:  process_flag(scheduler_affinity, term()).  Possibly with a generic group specified by an atom like undefined.  This feels more functional than the proposed paf module, and has the benefit of being data-centric.

The reason I would use a term (and then group by the hash of the term) is because it gives an elegant way to group processes by an arbitrary (possibly application specific) key.  Imagine if, for example, Mnesia grouped processes by a transaction ID, or if CouchDB grouped them by socket connection, etc.  By not specifying it as an atom or an integer, it lets you just use whatever is appropriate for the application.

I'm not too keen on reusing process groups primarily because group leaders are used for some really common stuff like IO, which shouldn't affect affinity at all.

If we want to be really crazy, we could provide the ability to specify something like a MatchSpec to map a process group to a processor.  Call it a SchedSpec.  This has the added bonus that you could have multiple handlers that would match in order without having the full blown load of a gen_event or arbitrary fun.  This might also provide the beginnings of more powerful prioritization than the existing process_flag(priority) we have now.

Currently, the Use Case that people seem to be concerned with is ensuring locality of execution.  However, some people might also want to use it to provide dedicated cores to things like system processing.  I have no idea how this would fit with things like the AIO threads, but I'm pretty sure that HPC could benefit from, for example, dedicating 1 scheduler to system management tasks, 1 core to IO, and 6 cores to computation.  This is a higher bar, but it's important nonetheless.

Of course, this would have the user thinking about the underlying CPU topology (which I agree is bad).  However, this is simply unavoidable in HPC, so it's best that we accept it.  Let me state this emphatically, if we try to make Erlang "smart" about scheduling, what is going to happen is that HPC people will dig down, figure out what its doing wrong, then come back with complaints.  We will never be able to make it work right for everyone without exposing these same tunables (but likely with a crappier interface).  It's better to give them powerful hooks to customize the scheduler with smart default behavior for everyone else.

The reason I like the process_flag(scheduler_affinity) / SchedSpec option is that it can easily start out with just the process_flag, and add something like SchedSpec's later, without having to change the API (or particularly the default behavior).  Basically, you get three groups of users:

* Normal People: They don't use affinity, although pieces of the system might. (effectively implemented already)
* Locality Users: They use affinity for locality using the convenient process_flag interface. (easily done with additional process_flag)
* HPC: They use affinity, and plugin SchedSpecs that are custom to their deployment. (can be provided when demanded without breaking first two groups)

On Nov 30, 2009, at 6:49 PM, Robert Virding wrote:

> Another solution would be to use the existing process groups as these are
> not really used very much today. A process group is defined as all the
> processes which have the same group leader. It is possible to change group
> leader. Maybe the VM could try to migrate processes to the same core as
> their group leader.
> 
> One problem today is that afaik the VM does not keep track of groups as
> such, it would have to do this to be able to load balance efficiently.
> 
> Robert
> 
> 2009/11/30 Evans, Matthew <mevans@REDACTED>
> 
>> Hi,
>> 
>> I've been running messaging tests on R13B02, using both 8 core Intel and 8
>> core CAVIUM processors. The tests involve two or more processes that do
>> nothing more than sit in a loop exchanging messages as fast as they can.
>> These tests are, of course, not realistic (as in real applications do more
>> than sit in a tight loop sending messages), so my findings will likely not
>> apply to a real deployment.
>> 
>> First the good news: When running tests that do more than just message
>> passing the SMP features of R13B02 are leaps and bounds over R12B05 that I
>> was running previously. What I have however noticed is that in a pure
>> messaging test (lots of messages, in a tight loop) we appear to run into
>> caching issues where messages are sent between processes that happen to be
>> scheduled on different cores. This got me into thinking about a future
>> enhancement to the Erlang VM: Process affinity.
>> 
>> In this mode two or more processes that have a lot of IPC chatter would be
>> associated into a group and executed on the same core. If the scheduler
>> needed to move one process to another core - they would all be relocated.
>> 
>> Although this grouping of processes could be done automatically by the VM I
>> believe the decision making overhead would be too great, and it would likely
>> make some poor choices as to what processes should be grouped together.
>> Rather I would leave it to the developer to make these decisions, perhaps
>> with a library similar to pg2.
>> 
>> For example, library process affinity (paf) could have the functions:
>> 
>> paf:create(Name,[Opts]) -> ok, {error, Reason}
>> paf:join(Name,Pid,[Opts]) -> ok, {error, Reason}
>> paf:leave(Name,Pid) -> ok
>> paf:members(Name) -> MemberList
>> 
>> An affinity group would be created with options for specifying the maximum
>> size of the group (to ensure we don't have all processes on one core), a
>> default membership time within a group (to ensure we don't unnecessarily
>> keep a process in the group when there is no longer a need) and maybe an
>> option to allow the group to be split over different cores if the group size
>> reaches a certain threshold.
>> 
>> A process would join the group with paf:join/3, and would be a member for
>> the default duration (with options here to override the settings specified
>> in paf:create). If the group is full the request is rejected (or maybe
>> queued). After a period of time the process is removed from the group and a
>> message {paf_leave, Pid} is sent to the process that issued the paf:join
>> command. If needed the process could be re-joined at that time with another
>> paf:join call.
>> 
>> Any takers? R14B01 perhaps ;-)
>> 
>> Thanks
>> 
>> Matt
>> 

-- 
Jayson Vantuyl
kagato@REDACTED