[erlang-questions] node to node message passing

Morten Krogh mk@REDACTED
Mon Sep 13 21:40:30 CEST 2010

  Hi Jan

Thanks for your answer. I guess we really disagree on this:)

I was stunned when I saw the nodes disconnecting in the middle of a 
large message passing,
and I think this must be improved at low level in the VM. I see it as 
similar to context switching.

Let me explain why I do that. But first, I will comment on the claim, 
that you and others make, that Erlang is not suited for large amounts of 
Why not? Erlang is implemented in C. Binaries can be stored as 
efficiently as in any other language. A binary can be sent to a socket 
using C functions.
Where is the fundamental problem?
Is this claim of Erlang being unsuited for large messages not just 
because people represent data with a space inefficient data structure, 
e.g., using a list of 4 byte integers instead of
a binary.

Back to message passing. A cluster of Erlang nodes, need to solve many 
tasks simultaneously. A task could, in  Erlang style, be solved by many 
cooperating processes, which use
message passing to communicate. There are many "tasks" being solved 
simultaneously, and they can have vastly different time profiles, and 
priorities. One task could be a fast
response to  a web application, and it might be implemented as a 
json/html process that communicates with several data base processes, 
maybe using a security process as well.
They will communicate with mostly small message of size <100 bytes say. 
At the same time, there could be huge file tansfers or backups of the 
data base processes going on.
Any example suffices. But the point is that the small task should finish 
rather fast independently of when it is started, and indepedently of 
what the large task is doing.
This is basically a rationale for multi tasking and context switching. 
And the Erlang vm does this for processes, and everybody agrees it is a 
good thing.
Now muylti tasking only works if all aspects of the task (computation) 
can go ahead without being blocked. You can context switch as much as 
you want, but it doesn't help if
the fast task gets stuck waiting for a message pass. so it is essential 
that the large messages can be preempted, and that the bandwidth can be 
assigned to the fast task.
It is probably obvious what I am saying. All parts of the computation 
must switch between tasks including the message passing between nodes on 
distinct computers.
Otherwise you get a bottleneck.

It would be like context switching in the registers and CPU, but the 
memory bus saying, "sorry CPU, you cannot get the data you are 
requesting now, because I am still transferring data for the previous 
process, the one you just preempted".



On 9/12/10 7:51 PM, Jan Huwald wrote:
> Am Sunday 12 September 2010 12:48:38 schrieb Morten Krogh:
>>    Hi Erlangers.
>> During some test with node to node communication, I sent a large binary
>> from a process on node A
>> to a process on another node, node B. I also sent some smaller messages
>> from other processes on node A to other
>> processes on node B. It turned out that the large message blocked the
>> later messages. Furthermore, it even blocked
>> the net tick communication, so node A and B disconnected from each other
>> even though the large message was being transferred!
>> After looking a bit around, I have come to the understanding that Erlang
>> uses one tcp connection between two nodes, and messages are sent
>> sequentially from the sending node A to the receiving node.
>> If that is correct, I think some improvements are needed.
> IMO the programmer has to take precautions on its own, if he expetcs to handle
> large messages. This is analogous to large memory requirements of single
> process. Erlang is not well suited for one of both, natively - which is good,
> because it keeps things simple.
> In case of large messages (or large process heaps) tradeoffs have to be made.
> Your proposed solution is one example for (a) the fact that these tradeoffs
> can be excluded from the language core (b) involving tradeoffs not everybody
> is agreeing on.
>> The problem to solve is basically that small messages, including the net
>> tick, should get through more or less independently of
>> the presence of large messages.
>> The simplest would be to have several connections, but that doesn't
>> fully solve the problem. A large message will still take up
>> a lot of the hardware bandwidth even on another tcp connection.
>> My suggestion is something like the following.
>> For communication between node A and node B, there is a process (send
>> process) on each node, that coordinates all messages. The send process
>> keeps queues of different priorities around, e.g., a high priority,
>> medium priority and low priority. Messages are split up into fragments of
>> a maximum size. The receiver(node B) send process assembles the
>> fragments into the original message and delivers it locally to the
>> right process. The fragments ensure that no single transfer will occupy
>> the connection for very long.
>> There will be a function send_priority where the user can specify a
>> priority. The usual send will default to medium, say.
>> Net tick will use high priority, of course. Small messages that are
>> needed to produce a web application response can have high priority.
>> File transfers
>> for backup purposes can have low priority.
>> The send process then switches between the queues in some way, that
>> could be very similar to context switching priorities.
>> More advanced, the send processes could occasionally probe the
>> connection with packets to estimate latency and bandwidth. Those figures
>> could then be used
>> to calculate fragment sizes. High bandwidth, high latency would require
>> large fragments. Low bandwidth, low latency small fragments for instance.
>> There could even be a function send_estimated_transfer_time that sends a
>> message and has a return value of estimated transfer time, which could
>> be used in
>> a timeout in a receive loop.
> Actively probing seems to me like a recipe for far more failures than it
> offers benefit. Passively probing might be ok, if the run-time overhead is
> small. But an actual implementation, which does not get biased by
> nonstationary activity of the application, seems quite complex to me.
>> I have actually implemented my own small module for splitting messages
>> into fragments, and it solves the issues; net tick goes through, and small
>> messages can overtake large ones.
>> There is of course an issue when the sending and receiving process is
>> the same for several messages. Either the guaranteed message order
>> should be given up, or the
>> coordinators should keep track of that as well. Personally, I think
>> guaranteed message order should be given up. Erlang should model the
>> real world as
>> much as possible, and learn from it. In the real world, two letters
>> going from person A to person B, can definitely arrive in the opposite
>> order
>> of the one in which they were sent. And as node to node communication
>> will be over larger and larger distances, it is totally unnatural to
>> require
>> a certain order.
>> I am relatively new to Erlang and I really enjoy it. Kudos to all involved!
>> Cheers,
>> Morten Krogh.
> Regards,
> Jan
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED

More information about the erlang-questions mailing list