[erlang-questions] How to detect or be alerted of faults in distributed Erlang systems?

Stefan Hellkvist hellkvist@REDACTED
Thu May 19 12:12:18 CEST 2016


Hello, 

Does anyone have any recommendations of tools or ways of working to share about how you detect or be alerted of faults on a global level in a distributed Erlang system? 

The reason why I ask is that I had a trivial fault in a gen_server where one of the handle_cast clauses had a missing {noreply, State} at the end of the clause, so whenever this message arrived at the gen_server the gen_server crashed because of the bad return value from the handle_cast clause. Unfortunately, due to the automatic restart of the gen_server, this went unnoticed for some time because all the actions in the clause was executing well and the message in particular was not very frequent, so the tests run on system level passed (although the restart did have some slight performance effects when occurring).

This should’ve of course been caught in a unit tests but this clause was sadly not unit tested. The error was of course clearly reported in the report.log of the node in question but since there were so many nodes in the system and since I am obviously not aware of the process for collecting errors like this (and report them somewhere) I failed to see this in this particular report.log. Until today...

So what I am looking for is advice or tools (perhaps it is even part of OTP?) on how to detect such failures on any node at some central point. All the nodes, even if they are all Erlang nodes, are not necessarily connected with every other node though, so it would be good to have a tool that does not depend on the nodes being connected. 

Stefan




More information about the erlang-questions mailing list