[erlang-questions] How to detect or be alerted of faults in distributed Erlang systems?

Thu May 19 12:29:07 CEST 2016

On Thursday 19 May 2016 12:12:18 Stefan Hellkvist wrote:
> Hello, 
> 
> Does anyone have any recommendations of tools or ways of working to share about how you detect or be alerted of faults on a global level in a distributed Erlang system? 
> 
> The reason why I ask is that I had a trivial fault in a gen_server where one of the handle_cast clauses had a missing {noreply, State} at the end of the clause, so whenever this message arrived at the gen_server the gen_server crashed because of the bad return value from the handle_cast clause. Unfortunately, due to the automatic restart of the gen_server, this went unnoticed for some time because all the actions in the clause was executing well and the message in particular was not very frequent, so the tests run on system level passed (although the restart did have some slight performance effects when occurring).
> 
> This should’ve of course been caught in a unit tests but this clause was sadly not unit tested. The error was of course clearly reported in the report.log of the node in question but since there were so many nodes in the system and since I am obviously not aware of the process for collecting errors like this (and report them somewhere) I failed to see this in this particular report.log. Until today...
> 
> So what I am looking for is advice or tools (perhaps it is even part of OTP?) on how to detect such failures on any node at some central point. All the nodes, even if they are all Erlang nodes, are not necessarily connected with every other node though, so it would be good to have a tool that does not depend on the nodes being connected. 

Hi, Stefan.

I think you'll probably get a million good specific recommendations here -- but I'll explain the two most frequent general cases I've seen:

1- Your organization's sysop people already have some monitoring framework for errors and your system just isn't sending data to them. (Like Sentry or whatever.)
2- Your organization has some log analysis jobs that extract some aggregate usage history in addition to highlighting error conditions reported in log messages -- but errors produced by your application just aren't in the rules yet.

For systems running in a pre-existing infrastructure or any significantly large system one (or both) of these has always been the case for me -- so after a discussion with a sysop the task becomes one of integration, not standing up a new service.

For smaller systems, though, or totally self-contained ones it can be possible that you might not even see logs for months and have no way to analyze them. In this case (assuming the systems are networked, at least intermittently) having them pass at least error logs to a central logging service can be a big benefit (in this I'm thinking specifically of some meterological systems I once dealt with that were scattered around a large geographical area, but did actually have some low-bandwidth network access).

As far as a sigle go-to recommendation ("Install The All New WhizzlePop Error Genie App (FOR ERLANG!!1!) and all your troubles will be a distant memory!")... I don't have one. There has always been some log analysis utility already sitting around somewhere, and I've only needed to make sure my system was integrating with it (a task I'll be dealing with next week all over again, in fact).

-Craig