[erlang-questions] Reliable kernels?

Tue Mar 28 07:41:55 CEST 2017

On 03/27/2017 09:22 PM, Richard A. O'Keefe wrote:
> I was introduced to the concept of a "million-year bug" today.
> That's a bug in a program that, if you were running it on a
> single CPU, would be expected to show up once in a million years.
>
> With a couple of thousand million Linux kernels around the world
> in phones &c, we can expect a million year bug in the kernel to
> show up tens of times a day.
>
> There's been a spate of problems with 911 in Dallas being overloaded
> with bogus calls apparently sent autonomously by certain mobile
> phones, to the point where a man and a child are thought to have
> died because genuine 911 callers were put on hold for a long time.
> That's probably *not* a million-year bug, but a million-year bug
> might do that kind of thing.
>
> Now me, I'm still happily pottering away on 4-core and 16-core
> machines (even a 1-core machine that sees a lot of use because it
> Just Keeps Working).  But I'm talking to people who want to use
> unbelievable amounts of computing power.
>
> I understand the Erlang Way:  write your software as lots of
> small things communicating through narrow protocols, *expect*
> failure and deal with it.  I believe!  Praise Joe, I believe!
>
> That's not the way the people I'm talking to think.  They've got
> a somewhat resilient data flow scheme they're proud of that has
> thousands and tens of thousands of nodes hooked up through
> Python, where the protocol between the nodes is Pyro.  Not,
> "uses Pyro", "IS Pyro".
>
> I'm supposed to tell these people what would be a good stripped
> down kernel to use (or what would be a good kernel to strip
> further), and I'm tempted to start by saying "strip away Python"
> which certainly won't make me popular (;-).  But thinking about
> error rates and million-year bugs has me thinking harder.
>
> It seems as if we have to think in terms of *expecting* the
> kernel itself to be unreliable-at-scale, so that something like
> JailHouse might be the right level to start.
>
> Does anyone have any experience with trying to harden a large system
> against faults in the distribution layer and in the kernel, and have
> any advice they'd care to share?
>
> None of this is Erlang-specific, it's just that I think the Erlang
> community are likely to have more relevant experience than most.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>

The expectation seems to be that microkernels would eventually replace monolithic kernels and the change would provide better reliability and security.  The two most accessible, well-known, and complete approaches appear to be:

1) seL4 which is open-source and formally verified (https://en.wikipedia.org/wiki/L4_microkernel_family) available at https://github.com/seL4/seL4
2) MINIX 3 (https://en.wikipedia.org/wiki/MINIX_3) available at http://www.minix3.org/

For normal UNIX use of an operating system you would likely need to go with MINIX 3 and use its ability to install from the NetBSD ports tree.  MINIX has a history of mainly being used for teaching, so this approach is likely not something most people would agree with immediately.  There are other attempts to pursue microkernels but it doesn't appear like they have been able to receive much attention and keep their SLOC low.

A more typical choice would be FreeBSD instead of Linux (though both are monolithic kernels) since FreeBSD is perceived as being more reliable (with the explanation that features are added to Linux at a much quicker rate which includes the addition of bugs).