Advantages of a large number of threads cf other approaches?

Wed Feb 18 12:15:13 CET 2004

Shawn Pearce wrote:

> Joachim Durchholz <joachim.durchholz@REDACTED> wrote:
> 
>> AFAIK the overhead of a Unix process over a thread is stuff like
>> the command line, environment variables, and a table of file
>> descriptors. I don't know how the exact layout is, but this
>> shouldn't amount to more than a handful of KBytes. That's still
>> more than the CPU context that must be swapped to RAM when 
>> switching threads, but not too much - modern processors typically
>> take several hundred bytes or more.
> 
> This is quite correct, but really the process overhead goes into:
> 
> - file descriptor table
 > - resource limit management and tracking
 > - environment (which is in userland not in the kernel)
 > - command line (also in userland, not in kernel)

The above is the true overhead.
The following applies whether the Erlang process is a full process or a
thread:

> - page tables (in kernel, and can be a memory hog)

Agreed, but code sharing would (could?) apply to both. No difference here.

> - private data pages holding global variables

This is entirely independent of whether you're doing threads or processes.

There's another potential: the run-time library. To make sure that it
doesn't take up memory for every Unix process, it's probably best to put
it into a shared library.

> When you have just a handful of globals, the data page which holds
> them can be mostly empty.  This really sucks up memory quick when you
> spawn a large number of the same processes.

Agreed.
If all threads are reasonably active, this is a small problem - the CPU
will become a bottleneck long before RAM hits any limits. This applies
to typical web servers like Yaws.
Of course, if you have 100,000s of threads just waiting for a signal,
then things are different. I don't know whether this situation is
typical - anybody able to provide data points in this directions?

> The CPU context is associated with the kernel level thread, so its
> going to be there no matter which you are using, process or a kernel
> thread.  Interstingly one of the big performance killers during
> process context switches is the loss of the TLB cache.  This is
> caching the most frequently accessed part of the page tables, and
> modern processors just don't have enough TLB cache space.  Some 
> require the entire TLB to be tossed whenever the kernel switches
> processes, some can cache parts of the TLB in case the process comes
> back quickly.  But thus far I've seen that TLB misses can easily kill
> your performance.

TLB issues enter into the picture iff you consider running the Erlang
interpreter in multiple OS processes.
When comparing "interpreter with its own scheduler" to "machine code
with OS scheduler", the TLB is just one of the things that make machine
code execution faster - TLB misses just eat up part of the performance
advantages of machine code, but they cannot get really slower (well,
they could in really pathological circumstances). (I'm wondering whether
CPUs that drop the entire TLB on every context switch should be
considered mainstream or broken. Such a policy would certainly hurt
interrupt performance.)

> One reason Erlang runs so well (and any other very tight user space 
> implementation) is they can eliminate all of the memory overheads,
> making an entire process fit into just a cache line or two in the
> processor's data cache. When an entire Erlang process can get swapped
> in by just one or two memory loads, and has no TLB miss penalties,
> etc, life can be very, very good.

Yup.
However, if it can beat machine code with CPU context switches, the CPU
design is very, very fishy :-)

> Also due to the effects of the relatively small i-cache on some
> processors, a well written emulator can in fact run faster than
> native code on very big applications.  If the application has very
> poor cache locatily in its instruction stream (randomly jumping to
> different functions) the i-cache can have a hard time keeping up.
> But if you move all of the code into the much larger d-cache (as in
> the case of an emulator) you can make the emulator fly as a much
> higher percentage of the application can be stored in the d-cache.

That's exactly the "fishy" case :-)

>>> You end up filling up the memory too quickly and soon start
>>> deterioating performance.
>> 
>> I don't think so - if you take a server with 1 GB RAM, the process
>> vs. thread overhead will cut the number of processes you can serve
>> from tens of billions to billions (under the worst-case assumption
>> that threads don't use any local data on their own). As soon as
>> every thread allocates a KByte of memory, the memory overhead 
>> diminishes to a factor of two, and it decreases further as each
>> thread uses more memory. But even in the worst case, I suspect that
>> the true bottleneck is the CPU, not RAM.
> 
> Its more like RAM bandwidth.  Your CPU is most likely stalling on all
> of the context switches due to the amount of data it must keep
> swapping on and off of the chip core.  Cycling through a bunch of
> registers ain't cheap.  And whacking your pipeline on a very deeply
> pipelined processor is no walk in the park either.  Then take into
> account the d-cache, i-cache and TLB misses you incur on each switch,
> and things go downhill very fast.

The numbers given in the paper tell me otherwise.

Let me give a few ballpark figures:
Assume a context switch frequency of 1 kHz, and an execution rate of 1
billion instructions per seconds. This means that a process has 1
million instructions before the next context switch.
Assuming that saving and restoring all that CPU state (registers, shadow
registers, pipeline, caches, TLB contents) takes the equivalent of 1,000
instructions of RAM bandwidth, the performance hit is 0.1 percent -
nothing that I'd worry much about.

Of course, these figures are just inventions of mine; CPU state
switching is probably around the order of magnitude that I have given,
but I have no idea of a typical time slice length in a modern OS.
OTOH it's probably safe to assume that the OS programmers have a good
idea about the context switch overhead and will adjust time slice size
so that the context switch overhead is small.

In an Erlang context, things are yet a bit different. If every process
does just a tiny bit of processing and goes into a wait long before it
has used up its time slice, then process context switching becomes
important.
Again, I don't know how this works out for typical Erlang applications.

> Of course, there are applications (like Mozilla!) that will just
> consume all memory on your machine, and then some, so you better not
> run multiple copies of them at once.  :-)

I don't know why I ever would want to run multiple copies of Mozilla ;-P
But if you meant Apache: it's indeed running multiple OS processes. I
don't think it's eating up much RAM that way - everything is in shared
libraries, and the OS per-process overhead is small compared to the
memory footprint of all that slice'n-dice functionality built into Apache.

I'd really, really like to see how things work out on 2.6 when comparing
Yaws and Apache - some factors favor Yaws, some favor Apache, and I
simply don't know which of them weigh in more.

>> The quoted IBM paper gave numbers for a concrete test run: a server
>> with 9 GB of RAM and eight 700-MHz processors had a near-100% CPU
>> usage, but just 36% memory usage (when serving 623 pages per
>> second). More interesting is that server performance increased by a
>> factor of six (!). Given that Yaws performance was ahead of Apache
>> by a factor of roughly 2.5 (at least on the benchmarks posted by
>> Joe), it would be very interesting to see how much Yaws profits
>> from the new Linux kernel.
> 
> Well, given that erts is bound to a single processor, you would need
> to create a cluster of erts nodes, all running yaws, with some type
> of load balancing front end.

AFAIK there's already a load balancing package for web services, written
in Erlang. (Sorry I forgot the name...)

It would be interesting to see how things work out on a uniprocessor
machine though. It's a strong selling point if you can say "you don't
need an SMP machine for much higher loads if you use Yaws instead of
Apache".

Regards,
Jo
--
Currently looking for a new job.