[OT] This is sad

David N Murray dmurray@REDACTED
Wed May 12 03:21:17 CEST 2010


http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html

"After finally recovering from the Cassandra failure and preparing to head
off for some much-needed sleep, our internal message bus (rabbitmq) died,
which added about an hour to the downtime. It dies like this pretty often
at 2am or at other especially bad times. Usually it doesn't cause any
data-loss, just sleep-loss (its queues are persisted and the apps just
build up their own queues until it comes back up), but in this case it
decided to crash in a way that corrupted its database of persisted queues
beyond repair. rabbitmq accounts for the only unrecoverable data-loss
incurred, which was about 400 votes. As far as we can tell, these were
entirely unlinked events. Coincidentally, rabbitmq crashed twice more that
day and a few more times into the weekend. For now we've upgraded to the
latest version of Erlang (rabbitmq is written in Erlang) since R13B-4 is
rumoured to have significantly better memory management which can act as a
temporary stopgap for the apparent reasons for some of the crashes, but
not all of them. Things have improved thus far, but replacing rabbitmq is
at the top end of our extremely long list of things to do."

Not good publicity.

:-(



More information about the erlang-questions mailing list