[erlang-questions] The odd behaviour of net_ticktime
Wed Jul 3 11:28:22 CEST 2013
I have observed - anecdotally via the m/l and google and through looking at the code in the kernel application - that net_kernel's tick mechanism appears to do nothing to prioritise tick messages over the dist port, so that when a connection between two nodes is very busy, it's possible for tick messages to lag behind other traffic leading to unnecessary disconnects.
Firstly, have I understood this properly? I couldn't find anything in the vm code that indicates special handling of these messages, nor have there been any statements from the OTP team on the m/l that I can find, indicating otherwise.
Secondly, if I'm right, then why is net_kernel doing things this way? If there is traffic on a dist port (i.e., socket), then clearly we've not actually experienced a netsplit, and reporting one seems like the wrong thing to do.
What I'm contemplating at the moment, is setting the net_ticktime to a very high value and implementing user-level heartbeats, combined with out of band checks (using getstat) on the dist ports in use, to verify that traffic is (or isn't) actually flowing before firing a 'DOWN' event. I really want to put all the effort in if this isn't necessary though.
Another question too, if I may: assuming I do decide to implement a user level heartbeat/keep-alive mechanism, I'm undecided about /how/ to indicate a detected netsplit to the processes in my application(s). Because most of these work on monitors, 'DOWN' messages and/or net_kernel:monitor_nodes already, I'm inclined to call net_kernel:disconnect_node/1 in response to seeing a netsplit in my user-level monitoring code - I would expect that to trigger all the relevant local monitors in a timely fashion. Does that sound like a reasonable way to go?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 235 bytes
Desc: Message signed with OpenPGP using GPGMail
More information about the erlang-questions