[erlang-questions] Handling Crash Reports at scale

Tue Jun 4 15:15:24 CEST 2013

I often start out by writing the code naively. The naive way of writing code is to assume that everything goes fine and there are nothing that could go wrong. Then the first crash happens and you go into the code with the knowledge of "this can happen, what do I wanna do?". In many cases the error is on the callers side, so just return a correct error back to him. In some cases you need to fix your own code because your naive approach didn't work. And so on. Fixing this error cleans up the error log for that case.

In effect, you are building your software on the good path only and then you sporadically handle the few error cases that will occur in practice. The code you add is only defensive if you are in doubt it will ever get executed.

Jesper Louis Andersen
  Erlang Solutions Ltd., Copenhagen

On Jun 4, 2013, at 1:48 PM, Peer Stritzinger <peerst@REDACTED> wrote:

> I'm not sure what you mean by defensive code, but just to be sure:
> 
> You should never let error reporting influence how you handle faults in your software.  Defensive code is a code smell in Erlang.
> 
> What you coucld do is to have a error logger handler that matches certain often occuring errors and ignores it.  Even better it should count them (the count of something unineresting is often interesting).
> 
> For counting and other metrics you might want to have a look at folsom https://github.com/boundary/folsom
> 
> -- Peer
> 
> 
> On 2013-06-03 17:29:54 +0000, ANTHONY MOLINARO said:
> 
> I'd recommend addressing the other crashes with "defensive code".  By capturing and categorizing certain types of errors and clearing those from your logs you'll be able to better see true errors.  I usually have a two phase approach.  First phase, I capture and log the types of errors.  Second, once I see certain types  are regular I replace those with metrics (via mondemand).  Once you have a stream of errors you can then monitor the rates, plus uncaught errors will make it into your logs which should remain very sparse.
> 
> -Anthony
> 
> On Jun 3, 2013, at 9:11 AM, Ransom Richardson <ransomr@REDACTED> wrote:
> Are there tools/procedures that are recommended for processing crash reports from a service running at scale?
> 
> Currently we have a limited deployment and I look through all of the crash reports by hand. Some are very useful for finding actual bugs in our code. But other crashes are the result of client's sending bad data, strange timing issues (mostly not in our code), etc and are not actionable. As we prepare to scale up our service, I'm wondering how to continue to get the value from the interesting crash reports without having to look through all of the uninteresting ones. 
> 
> I haven't found rb to be very useful for finding the new/interesting crashes. Are there effective ways that peopler are using it?
> 
> Are there other tools for parsing and grouping crash reports to make it easy to find new/interesting ones?
> 
> thanks,
> Ransom
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions