[erlang-questions] : lock-free hash table

Wed Aug 8 12:59:19 CEST 2007

Raimo Niskanen wrote:
> Since I am not really into these problems my comments may be
> rumour'ish, but anyway...
> 
> We are trying to use Atomic operations, in combination with
> appropriate algorithms, when possible. Using Atomic-based
> (lock-free) research paper data structures is however not
> something we have done yet.
> 
> There may not be that many data structures in the Emulator
> that fit well to e.g FIFO queues, but it sure is worth
> keeping in mind.

As far as I know there are implementations of most common data 
structures available (trees, hash tables, lists, stacks, heap, dynamic 
memory allocators, ets) built around lock-free ideas.

See Tony Finch's email in this thread and also:
http://www.research.ibm.com/people/m/michael/podc-2002.pdf
http://www.research.ibm.com/people/m/michael/spaa-2002.pdf
http://www.cs.chalmers.se/%7Etsigas/pubs.html

> Up to just recently the SMP work has been about correctness.
> If it is not as stable as the non-SMP version, it is
> mostly useless. We will optimize while it is stable enough.

Indeed.

> For the record: Atomic operations (necessary for lock-free
> data structures) are not that innocent. "single instructions
> such as compare-and-swap" sure sound harmless, but for
> such an instruction to work in a multiprocessor machine
> it will have to invalidate caches on all other processors
> as well as write through to real (slow) memory, so it is
> a heavy operation. Lighter then mutexes, but the gain
> is much smaller than one might expect.

In both Uniform Memory Access (UMA) and Cache-Coherent Non-Uniform 
Memory Access (ccNUMA) shared memory architectures, if they support 
compare-and-swap (or load-linked/store-conditional) hardware primitive 
(which most of the modern systems do), it's implemented in a way that 
since CPUs share the same bus to accessing memory, the atomicity is 
accomplished by retaining a lock of the bus during compare and swap 
operations.  If the memory location is cached, the content of memory is 
modified internally in the cache, and the cache coherency hardware 
mechanism propagates this to caches of other processors and slow memory (*).

(*) See Section 7.1.4:
http://www.intel.com/design/processor/manuals/253667.pdf

When I last tested FIFO circular queues in producer/consumer model on a 
dual-core CPU (**) I observed 3 times performance gain compared to 
mutex-locking approach for threads in the same process and about 9 times 
performance increase for the same application between two OS processes 
using shared memory.  From references in other papers one could see as 
much at 10x performance boost as the number of threads & CPUs increases.

In any event, I am glad to hear that you are considering some atomic 
synchronization improvements in the SMP emulator and look forward to 
seeing the outcome!

(**) It was based on this algorithm:
http://www.cs.chalmers.se/%7Etsigas/papers/latest-spaa01.pdf

Regards,

Serge

> On Tue, Aug 07, 2007 at 10:37:18PM -0500, Serge Aleynikov wrote:
>> Another useful application of SMP-safe lock-free FIFO queues could be 
>> inside the emulator.  Correct me if I am wrong but in the emulator with 
>> SMP support different schedulers use mutex synchronization in order to 
>> get the next process descriptor off of a shared linked list of all 
>> processes waiting to be dispatched.  Also when copying a message sent by 
>> some Erlang process to a mailbox of another Erlang process the emulator 
>> with SMP support uses POSIX locking to ensure thread safety (if I 
>> remember it correctly).
>>
>> In both of these cases using lock-free FIFO queues can provide a much 
>> better scalability when used on multi-core CPUs compared to traditional 
>> mutex-based locking primitives.
>>
>> Serge
>>
>> Serge Aleynikov wrote:
>>> There have been quite a number of publications on the subject of 
>>> lock-free data structures since around mid 90's, most notably by Maged 
>>> Michael (IBM) and Philippas Tsigas / Yi Zhang (Chambers University). 
>>> Here are some selected links:
>>>
>>> http://erdani.org/publications/cuj-2004-12.pdf
>>> http://citeseer.ist.psu.edu/sundell04scalable.html
>>> http://www.noble-library.com/
>>> http://research.sun.com/techrep/2002/abstract-110.html
>>>
>>> The lock-free algorithms indeed perform exceptionally well in shared 
>>> memory concurrent systems with a large number of CPUs since they don't 
>>> require OS synchronization primitives (relying on a single instruction 
>>> such as compare-and-swap).  I also think that taking advantage of these
>>> algorithms in Erlang's emulator would make it even more scalable on 
>>> large number of cores, especially if interfaces to ports (via shared 
>>> memory) and drivers (via lock-free FIFO circular queues) would offer 
>>> lock-free synchronization with emulator threads.
>>>
>>> Serge
>>>
>>> Ulf Wiger wrote:
>>>> I just caught Dr. Cliff Click's Google TechTalk on his
>>>> lock-free hash table.
>>>>
>>>> http://video.google.com/videoplay?docid=2139967204534450862
>>>>
>>>> The (Java) source is also available
>>>> at sourceforge: http://sourceforge.net/projects/high-scale-lib
>>>>
>>>> According to his talk, the algorithm "scales well up to 768 cpus".
>>>>
>>>> I don't know if there is anything in this that would not fit Erlang
>>>> well, but perhaps it's something to look into?
>>>>
>>>> Anyway, it's always interesting to listen to these performance geeks. ;-)
>>>>
>>>> BR,
>>>> Ulf W
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>