[erlang-questions] Handling Crash Reports at scale

Tue Jun 4 23:04:18 CEST 2013

We obviously mean something different by defensive coding.

Apparently there is a range of definitions for the word:

From:

http://c2.com/cgi/wiki?DefensiveProgramming   "Defensive programming 
defends against the currently impossible."

To:

http://en.wikipedia.org/wiki/Defensive_programming  defines it as 
basically "Secure Programming" a context in which I didn't hear it 
before (their first example strcpy to a fixed buffer vs. strncpy is a 
quite bad example, ist rather uncorrect programs vs. correct programs)

When I used the word I certainly didn't mean checking at the user/world 
facing borders of systems.   Also I didn't mean using assert() in C and 
pattern matching to have a implicit assert in Erlang.

In the case I meant it its:

{ok, F} = file:open("/var/log/mylogfile", [append])   %% not defensive, 
since its a bug in my system if this doesn't work

vs.

case file:open("/var/log/mylogfile", [append]) of   %% defensively 
handling potential buggy behaviour
    {ok, F} -> proceed_normally(F);
    {error, Reason} -> 
dont_know_what_to_do_now_really_trying_to_fix_the_situation_somehow()
 end

Might not  be the perfect example.

But by all means check your inputs, but if it is a "this can't happen" 
situation feel free to crash the subsystem.  If this does show up in 
your logs, either fix the bug or update your misconception of "this 
can't happen" if it occurs every 5minutes ;-)  <-- handle this and 
count it if it fills your logs otherwise

-- Peer

On 2013-06-04 16:31:56 +0000, ANTHONY MOLINARO said:

> Well, I almost didn't use the term defensive programming, but to some 
> handling any fault would be considered defensive.
> 
> As an example, lets say I'm parsing the value of an HTTP cookie.
> 
> The first pass I just assume the cookie is always good, and code for 
> the correct case (least defensive).
> If the cookie always parses correctly, I'm done, but lets say the 
> parsing code start failing and I see exceptions in my logs.  I then 
> want to determine is the parsing failing because the parsing is broken 
> (a software defect) or because of random corruption of the HTTP stream. 
>  So usually I'll put enough defensive code (e.g., code which captures 
> exceptions, instead of 'letting it fail'), to log the error cases, and 
> allow processing to continue.
> If the logging shows a defect it can be fixed, if it shows this is 
> general random corruption it can be left in place or replaced with a 
> counter.
> 
> I use mondemand for counters because I can get graphs easily and have 
> systems which use the data streams to alert based on specific changes.
> 
> I think that defensive code is necessary in many cases to handle faults 
> at the extremities of your system (and sometimes internally depending 
> on the boundaries).  So calling defensive code a code smell seems to 
> overly generalize it.
> 
> -Anthony
> 
> On Jun 4, 2013, at 4:48 AM, Peer Stritzinger <peerst@REDACTED> wrote:
> I'm not sure what you mean by defensive code, but just to be sure:
> 
> You should never let error reporting influence how you handle faults in 
> your software.  Defensive code is a code smell in Erlang.
> 
> What you coucld do is to have a error logger handler that matches 
> certain often occuring errors and ignores it.  Even better it should 
> count them (the count of something unineresting is often interesting).
> 
> For counting and other metrics you might want to have a look at folsom 
> https://github.com/boundary/folsom
> 
> -- Peer
> 
> 
> On 2013-06-03 17:29:54 +0000, ANTHONY MOLINARO said:
> 
> I'd recommend addressing the other crashes with "defensive code".  By 
> capturing and categorizing certain types of errors and clearing those 
> from your logs you'll be able to better see true errors.  I usually 
> have a two phase approach.  First phase, I capture and log the types of 
> errors.  Second, once I see certain types  are regular I replace those 
> with metrics (via mondemand).  Once you have a stream of errors you can 
> then monitor the rates, plus uncaught errors will make it into your 
> logs which should remain very sparse.
> 
> -Anthony
> 
> On Jun 3, 2013, at 9:11 AM, Ransom Richardson <ransomr@REDACTED> wrote:
> Are there tools/procedures that are recommended for processing crash 
> reports from a service running at scale?
> 
> Currently we have a limited deployment and I look through all of the 
> crash reports by hand. Some are very useful for finding actual bugs in 
> our code. But other crashes are the result of client's sending bad 
> data, strange timing issues (mostly not in our code), etc and are not 
> actionable. As we prepare to scale up our service, I'm wondering how to 
> continue to get the value from the interesting crash reports without 
> having to look through all of the uninteresting ones. 
> 
> I haven't found rb to be very useful for finding the new/interesting 
> crashes. Are there effective ways that peopler are using it?
> 
> Are there other tools for parsing and grouping crash reports to make it 
> easy to find new/interesting ones?
> 
> thanks,
> Ransom
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions