pipe messages sent between different nodes to boost speed

Tue Jun 9 14:13:46 CEST 2009

dear all,

two months ago i've started an experiment to increase speed of  
messages sent between different erlang nodes. i seem to have found a  
way to considerably increase this speed up to 3 times the native  
erlang speed, and would love to hear your feedback on this.

the idea is quite simple: queuing all messages sent from a node to  
another node, and sending them by groups. the whole concept is  
therefore to have a gen_server, called 'qr', running on every node  
where message passing takes place. a single process from a node A  
sends a message to a process of node B, being relayed by the two 'qr':

process on node A => 'qr' on node A => 'qr' on node B => process on  
node B.

this is something that is generally taken care of at lower level of  
implementations (tcp), with algorithms such as nagle, but i've decided  
to try a pipe/queuing mechanism at erlang application level too, to  
see if i could get any improvements.

a detailed explanation of the tests and benchmarks that i've performed  
are available here: http://www.ostinelli.net/boost-message-passing-between-erlang-nodes 
  and a new updated code is available here: http://www.ostinelli.net/wp-content/uploads/2009/06/erlang_mq_boost_2.zip 
  for you to try it out on your machine.

thanks to ulf wiger writing a note on the post above, i've also tried  
out the undocumented dist_nodelay kernel option, which did provide  
improvements, but still far from the ones i'm experimenting with the  
'qr' pipe mechanism.

please note that i'm in no way pretending to have found something  
great and new. i'm posting this here just because reactions to my  
linked post have mainly pointed towards telling me to perform  
additional tcp optimization, but i've personally been unable to find a  
way to reproduce the results of the 'qr' pipe mechanism by mere tcp  
optimization. also, the benchmarking test that i've used is quite  
specific, since it sends 200,000 messages in parallel first, which are  
then processed sequentially on the recipient node. this is because  
this test reflects a real need of an application i'm developing, where  
loads of client threads have to go through a bottleneck of a single  
registered process.

therefore, any opinions on this are warmly welcome, so as to have a  
better understanding on what is going on and hopefully produce better  
software.

thank you in advance those of you who took the time to read till  
here,  and more to the ones who will [hopefully] give me some feedback.

cheers,

r.