[erlang-questions] How to detect or be alerted of faults in distributed Erlang systems?

Thu May 19 13:51:42 CEST 2016

If a gen_server dies, error_logger is called. You need to find a way
to get hold of those error reports in a timely fashion. In our system,
we do this by using lager (https://github.com/basho/lager), which
installs an error_logger hook and turns these error reports into
error-level logging (alongside ordinary calls to lager:error).

Then we use the lager_syslog backend to forward everything to our
syslog / logstash / elastic search infrastructure, where we *could*
slice and dice the reports for spotting these kinds of things.

However, that's not what we *actually* do: we also have a custom lager
backend that publishes error-level messages to a rabbit queue. The
queue consumer groups the error reports together and then periodically
emails them to a designated address. By regularly checking the
relevant folder in my mail client, I can triage the error reports and
file the appropriate bugs in our issue tracker.

Obviously you could use your own error_logger hook to do this, rather
than using lager.

On 19 May 2016 at 11:12, Stefan Hellkvist <hellkvist@REDACTED> wrote:
> Hello,
>
> Does anyone have any recommendations of tools or ways of working to share about how you detect or be alerted of faults on a global level in a distributed Erlang system?
>
> The reason why I ask is that I had a trivial fault in a gen_server where one of the handle_cast clauses had a missing {noreply, State} at the end of the clause, so whenever this message arrived at the gen_server the gen_server crashed because of the bad return value from the handle_cast clause. Unfortunately, due to the automatic restart of the gen_server, this went unnoticed for some time because all the actions in the clause was executing well and the message in particular was not very frequent, so the tests run on system level passed (although the restart did have some slight performance effects when occurring).
>
> This should’ve of course been caught in a unit tests but this clause was sadly not unit tested. The error was of course clearly reported in the report.log of the node in question but since there were so many nodes in the system and since I am obviously not aware of the process for collecting errors like this (and report them somewhere) I failed to see this in this particular report.log. Until today...
>
> So what I am looking for is advice or tools (perhaps it is even part of OTP?) on how to detect such failures on any node at some central point. All the nodes, even if they are all Erlang nodes, are not necessarily connected with every other node though, so it would be good to have a tool that does not depend on the nodes being connected.
>
> Stefan
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions