Performance of selective receive

Mon Nov 14 03:15:19 CET 2005

On 14 Nov 2005, at 01:33, Pascal Brisset wrote:

> Sean Hinde writes:
>
>
>> No. This is only true if the server actually blocks. By the use of
>> gen_server:reply/2 you can make every caller believe that it has made
>> exclusive use of the server, but in fact the server can simply pass
>> the messages straight through (maybe updating some local internal
>> state on the way through).
>
> Agreed.  But this boils down to passing requests to the backend
> as fast as possible, hot-potato style, and hoping that it can
> cope with lots of pending requests itself.

Coping with genuine overload (i.e. you have exceeded the total CPU  
capacity of the system) is a different requirement to avoiding  
creating your own overload. If the CPU can't keep up then so be it..

>
>
>> Yes, that's fine, that is most likely required in such scenarios.
>> This is a different case to waiting for many seconds for some other
>> system to respond. The erlang scheduler is fair given half a chance.
>> In this case that means replacing the cast to your internal server
>> with a call.
>
> So here is another demo with synchronous calls instead of casts:
> 1000 servers, each sending 10 synchronous requests to the server
> every second.  On my PC it starts at 10 % CPU load and 10000 msg/s.
> Then the backend is paused for one second.  Afterward, the program
> stabilizes at 100 % CPU and 600 msg/s, with a huge message queue.
> Maybe that's a problem with my system.  Someone please confirm.

No, It is a still a problem with the design. There is a difference  
between an internal "backend" simply blocking itself for 1 second,  
and an internal server which is sending requests to an external  
system which pauses for 1 second. In the second case you can design  
the internal server to not block itself. If it exceeds the maximum  
number of requests allowed to be outstanding towards the "external"  
backend, then it can just return an immediate error.

Your new example looks mostly like a test designed to expose the  
"message queue backlog problem" of erlang, which no-one denies is  
present.

>
>
>> It is pretty unlikely that all 1000 servers sent their
>> requests at exactly the same moment, and even if they did, the system
>> would recover quickly, not spiral into meltdown as it would in the
>> async case.
>
> I claim that if the server loop has several receive statements,
> one of which is a selective receive, then as the message queue grows,
> each loop becomes more expensive.  If this extra burden exceeds  
> whatever
> spare CPU capacity the system had initially, it may fail to recover.

I see that. I guess you just have to be selective about where you  
apply this mechanism ;-) Maybe, just maybe, using selective receive  
right in the centre of the system in a critical server is not the  
right place :)

>
>
>> No. If that happens the problem is that the Erlang program is not
>> correctly designed.
>
> I agree.  I am only saying that this design error is not trivial.
> Once you are aware of it, a lot of strange fluctuations in CPU load
> begin to make sense, and you can fix things.  It would be even better
> if we could optimize selective receive so as to remove the mere
> possibility of running into trouble.

If this is possible without losing messages or the elegant simplicity  
of erlang then I agree. Otherwise AXD-301 stands as a stunning  
example that it can be designed correctly.

One thing I have used before is the overload module of OTP. If you  
can shed input load before the CPU gets 100% busy then the response  
to excessive demand can be made very clean and well behaved.

Sean