[erlang-questions] How much load can supervisors handle?

Fri Oct 26 07:44:07 CEST 2012

> Sounds like you should make you own custom supervisors and not use the
> "standard" supervisor. The above suggests that you need a process
> management layer.
> 
> I'm very reluctant to pre-guess anything about performance - best is
> to write the
> clearest posible code - run and measure - things like more memory or an SSD have
> enormous impact on efficiency. For some applications the difference
> beween 4G and 8G
> of memory make a large difference - for others no difference at all.
> It all depends.
> 
> When you say "thousands of processes" I have no idea if this means
> "thousands of tiny process with 1K stacks and heaps" or "thousands of processes
> with stacks and heaps of tens of MBytes" - the difference (and the
> architectures)
> is huge.
> 
> This is why there is no alternative to "code and measure".
> 
> Unfortunately logic cannot be applied
> 
> if P takes time A and Q takes time B
> how long does P+Q take?
> 
> This is not a science P+Q should take A+B on a sequential computer,
> and max(A,B) on a parallel computer. But this is not the case.
> 
> Performance estimation is a black art - the only thing I know is the old truth
> "parsing inputs" is slow.
> 
> Cheers
> 
> /Joe

Joe,
You are, of course, absolutely right about the need for testing and I've got a lot of time set out for it so that I can tune the system as need be. I also know better about providing detailed information and really should have included the system parameters. You, and others, have provided the information I needed though and I certainly have some places to start as I was primarily concerned about some of the higher level concepts. I just don't know enough about the inner workings of the language to be able to anticipate some of the performance characteristics and just wanted a gauge on if I was thinking about the problem correctly.
Chris.
ps. (Yes, this means you can stop reading now :) )
If anyone is at all interested here's what I know about the system:
Not at all unreasonable to expect 20,000 long lived workers. Long lived means several seconds to infinity, not counting code upgrades, outages, etal. Assumption is that roughly half would be permanent, another quarter or so would last a few hours to almost a day at the extreme end, and the rest in the seconds to minutes range. These workers will have stacks/heaps of no more then 5K, with probably 90% having 2K or less. 
It's also not unreasonable to expect short lived (microseconds) workers numbering from several hundred bursting up to 20,000, sustaining peak load for up to several hours the vast majority of the time with extreme instances lasting for weeks on end. In these last cases, however, there would still be some variation during the day so, while not technically 'peaked' the whole time, the system load, at the low points, would be much higher than is expected for the majority of the time. These will have stacks/heaps which will vary, by quite a bit. Since trying to estimate user input patterns without specific data on an already built system (I don't even have infrastructure for it yet, obviously) is pointless I just don't have anything solid to go on. However, my gut tells me that the majority will remain under 1k while a still significant minority will return 2k-5k in increasingly smaller amounts the higher you go, but probably not on a linear scale.
Outside of a few supporting applications such as os_mon or a tcp/http server the vast majority of the work will be done by one worker type, hence the requirement for one type of supervisor handling up to 40k of the same type of worker with a lot of churn. This is also going to facilitate, at some point, the distribution of the workers across multiple physical nodes with an eye to linear scalability, and redundancy. There's more but that is, I think, all the primary requirements for the system that are going to shape the way most of it is structured.
If anyone has any thoughts, to tell me I'm crazy, that they know of some resources, or that they just wasted two minutes of their life to read that and want them back (Hey, I told you that you could stop reading! I don't do refunds.), feel free. 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121025/95e0923f/attachment.htm>