[erlang-questions] Handling Crash Reports at scale

ANTHONY MOLINARO <>
Tue Jun 4 18:31:56 CEST 2013


Well, I almost didn't use the term defensive programming, but to some handling any fault would be considered defensive.

As an example, lets say I'm parsing the value of an HTTP cookie.

The first pass I just assume the cookie is always good, and code for the correct case (least defensive).
If the cookie always parses correctly, I'm done, but lets say the parsing code start failing and I see exceptions in my logs.  I then want to determine is the parsing failing because the parsing is broken (a software defect) or because of random corruption of the HTTP stream.  So usually I'll put enough defensive code (e.g., code which captures exceptions, instead of 'letting it fail'), to log the error cases, and allow processing to continue.
If the logging shows a defect it can be fixed, if it shows this is general random corruption it can be left in place or replaced with a counter.

I use mondemand for counters because I can get graphs easily and have systems which use the data streams to alert based on specific changes.

I think that defensive code is necessary in many cases to handle faults at the extremities of your system (and sometimes internally depending on the boundaries).  So calling defensive code a code smell seems to overly generalize it.

-Anthony

On Jun 4, 2013, at 4:48 AM, Peer Stritzinger <> wrote:

> I'm not sure what you mean by defensive code, but just to be sure:
> 
> You should never let error reporting influence how you handle faults in your software.  Defensive code is a code smell in Erlang.
> 
> What you coucld do is to have a error logger handler that matches certain often occuring errors and ignores it.  Even better it should count them (the count of something unineresting is often interesting).
> 
> For counting and other metrics you might want to have a look at folsom https://github.com/boundary/folsom
> 
> -- Peer
> 
> 
> On 2013-06-03 17:29:54 +0000, ANTHONY MOLINARO said:
> 
> I'd recommend addressing the other crashes with "defensive code".  By capturing and categorizing certain types of errors and clearing those from your logs you'll be able to better see true errors.  I usually have a two phase approach.  First phase, I capture and log the types of errors.  Second, once I see certain types  are regular I replace those with metrics (via mondemand).  Once you have a stream of errors you can then monitor the rates, plus uncaught errors will make it into your logs which should remain very sparse.
> 
> -Anthony
> 
> On Jun 3, 2013, at 9:11 AM, Ransom Richardson <> wrote:
> Are there tools/procedures that are recommended for processing crash reports from a service running at scale?
> 
> Currently we have a limited deployment and I look through all of the crash reports by hand. Some are very useful for finding actual bugs in our code. But other crashes are the result of client's sending bad data, strange timing issues (mostly not in our code), etc and are not actionable. As we prepare to scale up our service, I'm wondering how to continue to get the value from the interesting crash reports without having to look through all of the uninteresting ones. 
> 
> I haven't found rb to be very useful for finding the new/interesting crashes. Are there effective ways that peopler are using it?
> 
> Are there other tools for parsing and grouping crash reports to make it easy to find new/interesting ones?
> 
> thanks,
> Ransom
> 
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions
> 
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130604/da38029e/attachment.html>


More information about the erlang-questions mailing list