Advantages of a large number of threads cf other approaches?

Wed Feb 18 01:51:49 CET 2004

Joachim Durchholz <joachim.durchholz@REDACTED> wrote:
> AFAIK the overhead of a Unix process over a thread is stuff like the 
> command line, environment variables, and a table of file descriptors. I 
> don't know how the exact layout is, but this shouldn't amount to more 
> than a handful of KBytes.
> That's still more than the CPU context that must be swapped to RAM when 
> switching threads, but not too much - modern processors typically take 
> several hundred bytes or more.

This is quite correct, but really the process overhead goes into:

- file descriptor table
- resource limit management and tracking
- environment (which is in userland not in the kernel)
- command line (also in userland, not in kernel)
- page tables (in kernel, and can be a memory hog)
- private data pages holding global variables

When you have just a handful of globals, the data page which holds them can be
mostly empty.  This really sucks up memory quick when you spawn a large number
of the same processes.

The CPU context is associated with the kernel level thread, so its going to
be there no matter which you are using, process or a kernel thread.  Interstingly
one of the big performance killers during process context switches is the loss
of the TLB cache.  This is caching the most frequently accessed part of the page
tables, and modern processors just don't have enough TLB cache space.  Some
require the entire TLB to be tossed whenever the kernel switches processes,
some can cache parts of the TLB in case the process comes back quickly.  But
thus far I've seen that TLB misses can easily kill your performance.

One reason Erlang runs so well (and any other very tight user space
implementation) is they can eliminate all of the memory overheads, making an
entire process fit into just a cache line or two in the processor's data cache.
When an entire Erlang process can get swapped in by just one or two memory
loads, and has no TLB miss penalties, etc, life can be very, very good.

Also due to the effects of the relatively small i-cache on some processors, a
well written emulator can in fact run faster than native code on very big
applications.  If the application has very poor cache locatily in its
instruction stream (randomly jumping to different functions) the i-cache can
have a hard time keeping up.  But if you move all of the code into the much
larger d-cache (as in the case of an emulator) you can make the emulator fly
as a much higher percentage of the application can be stored in the d-cache.

Now this is easily worked around by using tools to reorder your native code
functions in the huge application such that they occur in execution order.  I've
seen this easily give a 40% performance boost (or more!) on x86 processors.  If
you do this, you should easily beat the emulator.  :-)

> >You end up filling up the memory too quickly and soon start deterioating
> >performance.
> 
> I don't think so - if you take a server with 1 GB RAM, the process vs. 
> thread overhead will cut the number of processes you can serve from tens 
> of billions to billions (under the worst-case assumption that threads 
> don't use any local data on their own).
> As soon as every thread allocates a KByte of memory, the memory overhead 
> diminishes to a factor of two, and it decreases further as each thread 
> uses more memory.
> But even in the worst case, I suspect that the true bottleneck is the 
> CPU, not RAM.

Its more like RAM bandwidth.  Your CPU is most likely stalling on all of the
context switches due to the amount of data it must keep swapping on and off
of the chip core.  Cycling through a bunch of registers ain't cheap.  And
whacking your pipeline on a very deeply pipelined processor is no walk in the
park either.  Then take into account the d-cache, i-cache and TLB misses you
incur on each switch, and things go downhill very fast.

Of course, there are applications (like Mozilla!) that will just consume all
memory on your machine, and then some, so you better not run multiple copies
of them at once.  :-)

> The quoted IBM paper gave numbers for a concrete test run: a server with 
> 9 GB of RAM and eight 700-MHz processors had a near-100% CPU usage, but 
> just 36% memory usage (when serving 623 pages per second).
> More interesting is that server performance increased by a factor of
> six (!). Given that Yaws performance was ahead of Apache by a factor of 
> roughly 2.5 (at least on the benchmarks posted by Joe), it would be very 
> interesting to see how much Yaws profits from the new Linux kernel.

Well, given that erts is bound to a single processor, you would need to
create a cluster of erts nodes, all running yaws, with some type of load
balancing front end.  This is one area Apache really shines in, as it
easily allows this to be setup: because Apache is multi-process already, it
can easily share the single TCP server socket with all of its siblings and
decide who gets the next request.

Does anyone think it might be possible to modify gen_tcp in such a way that
we could use multiple nodes on the same system all bound to the same TCP port,
and using some sort of accept lock between them?  I'd think this could be done
something like this:

	% Setup a server socket, but let it be shared by this Erlang node and all
	% other process on this box.
	gen_tcp:accept(... [shared])

	% Have this node take over accepting all new connections.  This just pokes
	% the socket into the erts event loop.
	gen_tcp:enable_accept(Port)

	% Have this node stop accepting new connections.  This just removes the
	% socket from the erts event loop.
	gen_tcp:disable_accept(Port)

It might be necessary however (for performance reasons) to let the low level
C driver also perform its own accept lock using a sys-v IPC sem, flock, fcntl,
etc, on top of the Erlang managed enable and disable.    If the socket is
enabled, then the driver should attempt to grab the accept lock, and only
when it wins it it puts it into the erts event loop.  Clearly this might
be difficult as the driver cannot block while trying to grab the accept lock.

Note that Linux doesn't require the accept lock I believe... I think its
accept only returns the socket to one process.  But I'm not positive.

-- 
Shawn.

  Velilind's Laws of Experimentation:
  	(1) If reproducibility may be a problem, conduct the test only once.
  	(2) If a straight line fit is required, obtain only two data points.