20k messages in 4s but want to go faster!

Fri Jul 10 22:59:13 CEST 2009

For testing purposes if you could run the clients in the same node and
the bot manager in a different node, then the clients would have the
timestamp synced, and it would give you an idea of the latency. (Thats
what I do, and I presume distributing the clients would be faster, at
least I get a worst case scenario).

What I do is I have all the clients send a message to the test manager
(equivalent to your bot manager) when they are ready, then the test
manager has one client send a message to the group, as each client
gets the message it stores the ave/min/max latency,  the tester
manager waits a certain time (to allow all clients to get the message,
I actually send multiple messages to get a good average) then sends a
done message to each client. When the client gets the done message it
sends the latency stats to the test manager, once the test manager has
all the latencies it prints out the averages and min/max.

This works pretty well and times the message latency through the
system, and nothing else not even the communication time with the test
manager. As I time multiple messages it also smoothes out any dynamic
module load times that may occur hen getting the first message etc.
What I am looking for is an accurate latency time, once the system is
stable, not the startup time etc.

I don't seem to be able to run 20,000 clients right now, probably
memory limitation or something (it just hangs with no error dump, i
need to debug that!) However I get a message latency of around 100ms
worst case for 4,000 clients all running on the same node, and even
the same machine as the tester. The average latency is much lower
around 60ms. So I would expect to be able to get well under 1 second
latency for 20,000 clients especially if they are distributed. One
caveat though is my messages are UDP not TCP (its a voice server), but
I can send them TCP, I'll give it a try and see if there is much
difference.

On Jul 10, 1:17 pm, Joel Reymont <joe...@REDACTED> wrote:
> On Jul 10, 2009, at 9:07 PM, Jim Morris wrote:
>
> > I am wondering if you are not including in
> > your timing the process death and processing of the DOWN message as
> > well, which could be a lot longer than sending/receiving the message.
>
> I _am_ including all that in the timing, which probably skews my stats.
> That overhead would not be present when sending to a browser-based  
> client.
>
> > I suspect you are only interested in the latency between the bot
> > sending the message and the 20k clients getting that message, not
> > process exit time and associated garbage cleanup etc.
>
> Absolutely right.
>
> > In your case if the bots can communicate with the bot manager and all
> > are running on the same node,
>
> Bots are running on different servers to spread the load.
>
> > maybe you could send the timestamp of
> > when each bot gets the message to the bot manager, and the timestamp
> > when the message was sent,
>
> I like the idea and I thought about it. I'm not sure if the EC2  
> instances are time-synced, though.
>
> There's the same monitoring and DOWN message overhead involved in  
> firing the timer in the bots since I need to broadcast when all bots  
> are ready and that's when bots tart their timers.
>
> A process exits when 'bumped' by the required number of bots. Both the  
> broadcasting bot and the  client bots listen for DOWN from this single  
> process to publish and start timing.
>
>         Thanks, Joel
>
> ---
> Mac hacker with a performance benthttp://www.linkedin.com/in/joelreymont
>
> ________________________________________________________________
> erlang-questions mailing list. Seehttp://www.erlang.org/faq.html
> erlang-questions (at) erlang.org