[erlang-questions] Nested Case Statements v.s. multiple functions

Tue Sep 26 18:57:18 CEST 2017

On 2017年09月26日 火曜日 11:43:54 you wrote:
> Thank you all for your replies, I went ahead and changed my function names and added guards, and it now looks far cleaner.
> 
> One thing that I keep seeing in this thread though is some variant of “crash early, crash often,” and this is a little troubling. What if you are writing a program that is receiving a messages and has a queue awaiting action? If the program dies, those messages will be lost, and if the calling processes made ‘casts’ then those messages won’t ever be delivered or processed. Also, it takes a bit of time for the supervisor to re-initialize the process, and this could be bad. 
> 
> Is it always a good idea to “crash often” when bad input is received?

Yes, it is.

To imagine that a queue is building up is to imagine that we allow proceses that are at the edge of the crashability matrix (that is, workers the farthest out on the supervision tree, farthest from the crash kernel of the system) to accumulate important state.

If something has the job of accumulating a queue of messages, its sole job is probably accumulating that queue. This is a simple job, generally speaking, and so the odds of that process crashing are VERY LOW. But it could happen. Most of the memory you will ever be exposed to is non-ECC, for example, so a cosmic ray likely will cause data corruption eventually (not to mention hardware errors and random acts of malintentioned deities, also, nothing ever works right on Christmas or Tuesday). Do you want to continue on with randomly bad data or crash and recover to a KNOWN STATE?

Hint: you're never going to actually figure out what was wrong with the bad data, only that it was bad in an ambiguous way.

Since most of your Heisenbugs are going to be random acts of nature or system states that are practically impossible to replicate, crashing is by fare the more favorable option. In fact, you've really only got one option, and that is to restart *in a known state*. Would you prefer that cost you 10k LoC in some byzantine braid of exception handling code interleaved with your business logic that buys you next to nothing in terms of understanding the problem (but costs a ton in terms of development time and money), or prefer that 99% of that was already part of a framework that handles things like this by design and was totally separate from your "happy path" business logic code?

That's the tradeoff addressed in Erlang system design, and is dealt with in the general case by OTP.

So let's return to that queue accumulator process that we now know has the sole job of accumulating things to process. 

Because it is simple it is more likely not to crash. Simple == more stable and easier to glance at and prove it has few, if any, bugs. Generally speaking, of course. What else might we want it to do if it is an accumulator?

Perhaps it should know some things about the state of the runtime like load and memory use -- so it can determiner whether it should have fewer or more processes doing processing jobs just then. And, perhaps most critically, it should probably know how to shed load when the queue is just unreasonably huge. That could mean telling external connections to throttle, it could be flagging an unavailable state in the system to wave clients off from making requests, or it could just silently shed load and log it but keep chugging along.

The optimal strategies are up to you, but the basic tools to implement those strategies are baked into Erlang and OTP.

But what you don't see here is any discussion about in-depth, complex, businesss logic. None. Zero. If that process' job is to track a stateful queue then that's what it does. And it protects that queue from bad things like overload. It delegates the work, though, to subordinate workers (the spawning of which might adhere to any of a handful of core strategies). Those workers are single-threaded programs that work in an isolated memory space and they do complex things, and certainly might crash. And that's totally OK in the context of a system handling a bajillion messages.

And... since the message was dispatched by the queue manager, we could decide to retain the message and react to the crash of the worker (which could very well be a Heisenbug which won't pop up on a second run) by running a retry, and if that crashes also decide to give up.

That is somwhat similar to what we see already built in to OTP. You customize this case when you need to, but you don't just build up a ton of state in a single process and then lose everything when it dies. Whenever you have started to do that is when you've started to build a crash kernel that is way too big.

You may be wondering what "crash kernel" means. Every program has a crash kernel. In the single-threaded world the entire thing is part of the crash kernel, so identifying this concept is a useless exercise. In Erlang, though, we have to figure out what part of the system's state is sort of ephemeral and can be tried again or just flat out lost without defeating the purpose of the program, and what part of the system's state is so central that if it is lost the entire system should be brought down and restarted to cope with it. The latter is the crash kernel. A primary design goal of a robust system is to actively work toward a system architecture where the crash kernel is as tiny as possible.

Some tasks are more amenable to this than others, but as you move through your Erlang life you'll find that most problems that involve automated systems are just not that critical to get right EVERY time (hardware and transmission errors tend to outnumber software ones), and that when dealing with humans you can generally rely on them to re-try stuff that didn't work -- because humans are just fantastically trainable in the skill of frantically re-clicking and retyping things (so you may want to train yourself in the skill of writing indempotent functions...).

-Craig