[erlang-questions] large scale deployments and netsplits

Wed Sep 16 20:02:14 CEST 2009

This is with a default net tick time.

=INFO REPORT==== 16-Sep-2009::17:44:55 ===
     module: stats
     elapsed: 20.014404
     {flash_errors,1}: 24350
     {flash_total_connected,1}: 479678
     {flash_total_fails,1}: 493159
     {flash_total_started,1}: 468819
     {flash_total_sub_ack,1}: 444984
     {flash_total_sub_req,1}: 448042
     {flash_total_tcp_errors,1}: 802
     {flash_connected,1}: 290320
     {flash_started,1}: 253302
     {flash_sub_ack,1}: 270553
     {flash_sub_req,1}: 270999
     {flash_tcp_errors,1}: 280
     {"flash_connected/sec",1}: 14505
     {"flash_started/sec",1}: 12655
     {"flash_sub_ack/sec",1}: 13517
     {"flash_sub_req/sec",1}: 13540
     {"flash_tcp_errors/sec",1}: 13

I have 468,819 bots started on 100 small EC2 instances,
1 VM per instance. 479,678 bots connected. The number
is higher than started because bots can connect multiple
times, e.g. when there's an error. 290,320 bots connected
in just the last 20s.

=ERROR REPORT==== 16-Sep-2009::17:45:15 ===
** Node 'janus@REDACTED' not  
responding **
** Removing (timedout) connection **
** at node janus@REDACTED **

=INFO REPORT==== 16-Sep-2009::17:45:15 ===
netsplit: down: 'janus@REDACTED',  
latency: 3.0330ms
** at node janus@REDACTED **

I put in a piece of code that saves nodes that are up
and pings them every 15s, watching out for pongs and saving
the latency. I print the latency once the node is down.

What I see is a latency of just 3ms (within EC2 of course)
when the node splits. What could be causing this?

	Thanks, Joel

---
fastest mac firefox!
http://wagerlabs.com