Advantages of a large number of threads cf other approaches?

Thu Feb 19 17:56:48 CET 2004

On Tue, 17 Feb 2004 19:51:49 -0500, Shawn Pearce <spearce@REDACTED> 
wrote:

> Also due to the effects of the relatively small i-cache on some 
> processors, a well written emulator can in fact run faster than
> native code on very big applications.  If the application has very
> poor cache locatily in its instruction stream (randomly jumping to
> different functions) the i-cache can have a hard time keeping up.
> But if you move all of the code into the much larger d-cache (as
> in the case of an emulator) you can make the emulator fly
> as a much higher percentage of the application can be stored in the 
> d-cache.

Some work has been done on improving these aspects of BEAM, partly
due to Thomas Lindgren's work on low-level profiling of AXD 301
code, compiled with HiPE, JAM and BEAM. Björn G. took some of
Thomas's findings to heart and improved the cache hit ratio.

> Now this is easily worked around by using tools to reorder your native 
> code functions in the huge application such that they occur in execution 
> order.  I've seen this easily give a 40% performance boost (or more!)
> on x86 processors.  If you do this, you should easily beat the
> emulator.  :-)

I'm sure, but I've had reason to look at some low-level profiling
of large applications written in Rational RoseRT, with a mix of
generated and hand-written C++. Lots of pipeline stalls and cache
misses

>
>> >You end up filling up the memory too quickly and soon start 
>> deterioating
>> >performance.
>>
>> I don't think so - if you take a server with 1 GB RAM, the process vs.
>> thread overhead will cut the number of processes you can serve from tens
>> of billions to billions (under the worst-case assumption that threads
>> don't use any local data on their own).
>> As soon as every thread allocates a KByte of memory, the memory overhead
>> diminishes to a factor of two, and it decreases further as each thread
>> uses more memory.
>> But even in the worst case, I suspect that the true bottleneck is the
>> CPU, not RAM.
>
> Its more like RAM bandwidth.  Your CPU is most likely stalling on all of 
> the
> context switches due to the amount of data it must keep swapping on and 
> off
> of the chip core.  Cycling through a bunch of registers ain't cheap.  And
> whacking your pipeline on a very deeply pipelined processor is no walk 
> in the
> park either.  Then take into account the d-cache, i-cache and TLB misses 
> you
> incur on each switch, and things go downhill very fast.
>
> Of course, there are applications (like Mozilla!) that will just consume 
> all
> memory on your machine, and then some, so you better not run multiple 
> copies
> of them at once.  :-)
>
>> The quoted IBM paper gave numbers for a concrete test run: a server with
>> 9 GB of RAM and eight 700-MHz processors had a near-100% CPU usage, but
>> just 36% memory usage (when serving 623 pages per second).
>> More interesting is that server performance increased by a factor of
>> six (!). Given that Yaws performance was ahead of Apache by a factor of
>> roughly 2.5 (at least on the benchmarks posted by Joe), it would be very
>> interesting to see how much Yaws profits from the new Linux kernel.
>
> Well, given that erts is bound to a single processor, you would need to
> create a cluster of erts nodes, all running yaws, with some type of load
> balancing front end.  This is one area Apache really shines in, as it
> easily allows this to be setup: because Apache is multi-process already, 
> it
> can easily share the single TCP server socket with all of its siblings 
> and
> decide who gets the next request.
>
> Does anyone think it might be possible to modify gen_tcp in such a way 
> that
> we could use multiple nodes on the same system all bound to the same TCP 
> port,
> and using some sort of accept lock between them?  I'd think this could 
> be done
> something like this:
>
> 	% Setup a server socket, but let it be shared by this Erlang node and 
> all
> 	% other process on this box.
> 	gen_tcp:accept(... [shared])
>
> 	% Have this node take over accepting all new connections.  This just 
> pokes
> 	% the socket into the erts event loop.
> 	gen_tcp:enable_accept(Port)
>
> 	% Have this node stop accepting new connections.  This just removes the
> 	% socket from the erts event loop.
> 	gen_tcp:disable_accept(Port)
>
> It might be necessary however (for performance reasons) to let the low 
> level
> C driver also perform its own accept lock using a sys-v IPC sem, flock, 
> fcntl,
> etc, on top of the Erlang managed enable and disable.    If the socket is
> enabled, then the driver should attempt to grab the accept lock, and only
> when it wins it it puts it into the erts event loop.  Clearly this might
> be difficult as the driver cannot block while trying to grab the accept 
> lock.
>
> Note that Linux doesn't require the accept lock I believe... I think its
> accept only returns the socket to one process.  But I'm not positive.
>

-- 
Ulf Wiger, Senior System Architect
EAB/UPD/S

This communication is confidential and intended solely for the addressee(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you believe this message has been sent to you in error, please notify the sender by replying to this transmission and delete the message without disclosing it. Thank you.

E-mail including attachments is susceptible to data corruption, interruption, unauthorized amendment, tampering and viruses, and we only send and receive e-mails on the basis that we are not liable for any such corruption, interception, amendment, tampering or viruses or any consequences thereof.