[erlang-questions] How to detect or be alerted of faults in distributed Erlang systems?

Fri May 20 07:32:48 CEST 2016

Thank you all for your suggestions and recommendations. I will look into WombatOAM and see what that is about. I do not have any existing, larger OAM system to integrate with so I will have to roll my own somehow. The nodes use lager and rabbitmq already so a solution similar to what Roger suggested might be what I will attempt. 

Thanks
Stefan

> On 19 May 2016, at 14:06, Zsolt Laky <zslaky@REDACTED> wrote:
> 
> Hi Stefan,
> 
> In addition what Roger mentioned, you may consider a monitoring tool like WombatOAM. It may be a big help in running a distributed environment. Out of the box, no need to implement sophisticated alert systems.
> 
> Best,
> 
> Zsolt
>> On May 19, 2016, at 1:51 PM, Roger Lipscombe <roger@REDACTED> wrote:
>> 
>> If a gen_server dies, error_logger is called. You need to find a way
>> to get hold of those error reports in a timely fashion. In our system,
>> we do this by using lager (https://github.com/basho/lager), which
>> installs an error_logger hook and turns these error reports into
>> error-level logging (alongside ordinary calls to lager:error).
>> 
>> Then we use the lager_syslog backend to forward everything to our
>> syslog / logstash / elastic search infrastructure, where we *could*
>> slice and dice the reports for spotting these kinds of things.
>> 
>> However, that's not what we *actually* do: we also have a custom lager
>> backend that publishes error-level messages to a rabbit queue. The
>> queue consumer groups the error reports together and then periodically
>> emails them to a designated address. By regularly checking the
>> relevant folder in my mail client, I can triage the error reports and
>> file the appropriate bugs in our issue tracker.
>> 
>> Obviously you could use your own error_logger hook to do this, rather
>> than using lager.
>> 
>> On 19 May 2016 at 11:12, Stefan Hellkvist <hellkvist@REDACTED> wrote:
>>> Hello,
>>> 
>>> Does anyone have any recommendations of tools or ways of working to share about how you detect or be alerted of faults on a global level in a distributed Erlang system?
>>> 
>>> The reason why I ask is that I had a trivial fault in a gen_server where one of the handle_cast clauses had a missing {noreply, State} at the end of the clause, so whenever this message arrived at the gen_server the gen_server crashed because of the bad return value from the handle_cast clause. Unfortunately, due to the automatic restart of the gen_server, this went unnoticed for some time because all the actions in the clause was executing well and the message in particular was not very frequent, so the tests run on system level passed (although the restart did have some slight performance effects when occurring).
>>> 
>>> This should’ve of course been caught in a unit tests but this clause was sadly not unit tested. The error was of course clearly reported in the report.log of the node in question but since there were so many nodes in the system and since I am obviously not aware of the process for collecting errors like this (and report them somewhere) I failed to see this in this particular report.log. Until today...
>>> 
>>> So what I am looking for is advice or tools (perhaps it is even part of OTP?) on how to detect such failures on any node at some central point. All the nodes, even if they are all Erlang nodes, are not necessarily connected with every other node though, so it would be good to have a tool that does not depend on the nodes being connected.
>>> 
>>> Stefan
>>> 
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>