[erlang-questions] Handling Crash Reports at scale
Mon Jun 3 19:29:54 CEST 2013
I'd recommend addressing the other crashes with "defensive code". By capturing and categorizing certain types of errors and clearing those from your logs you'll be able to better see true errors. I usually have a two phase approach. First phase, I capture and log the types of errors. Second, once I see certain types are regular I replace those with metrics (via mondemand). Once you have a stream of errors you can then monitor the rates, plus uncaught errors will make it into your logs which should remain very sparse.
On Jun 3, 2013, at 9:11 AM, Ransom Richardson <ransomr@REDACTED> wrote:
> Are there tools/procedures that are recommended for processing crash reports from a service running at scale?
> Currently we have a limited deployment and I look through all of the crash reports by hand. Some are very useful for finding actual bugs in our code. But other crashes are the result of client's sending bad data, strange timing issues (mostly not in our code), etc and are not actionable. As we prepare to scale up our service, I'm wondering how to continue to get the value from the interesting crash reports without having to look through all of the uninteresting ones.
> I haven't found rb to be very useful for finding the new/interesting crashes. Are there effective ways that peopler are using it?
> Are there other tools for parsing and grouping crash reports to make it easy to find new/interesting ones?
> erlang-questions mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions