[erlang-questions] Control-Z as a scheduler test ... provokes gen_leader bug?

Wed May 7 06:16:05 CEST 2008

Hi, all.  I've got a question about the fairness of using Control-Z as a
brute-force method of disrupting the VM's scheduler.

I used Control-Z and "kill -STOP {pid}" to simulate Erlang VM failures.
It's easier than yanking Ethernet cables and plugging them back in.
(Especially when all nodes are running on the same physical machine.)
After all, such suspension simulates a *really* slow/overloaded/
unresponsive-but-not-yet-dead Erlang VM.

I'm searching for opinions on a couple of questions, actually.

1. Is this a failure case that can happen in The Real World?  Crazier
   things have happened, such as:
      a. "killall -STOP beam"
      b. Suspending an entire machine by sending a BREAK to a Sun RS-232
         serial console (or name-your-hardware/OS-equivalent).
      c. Virtual memory thrashing causing the OS scheduler avoid
         scheduling the beam process for many seconds.

2. When it comes to long-term/persistent data storage (and other apps),
   people can get pretty pissed off even when one-in-a-million events
   happen.  (Er, I dunno if there's a question here. :)

3. Hrm, it looks like using the heart app would avoid this.
   What strategies do you use, if your app isn't running as "root"
   and so "whole OS reboot" isn't an option?

Below is the set of steps to demonstrate the problem with a 3-node
gen_leader system.  Is it possible that the bug lies within gdict.erl or
test_cb.erl instead?

-Scott

1. In the shell of each of the three nodes:

    gdict:new(foo, ['a@REDACTED', 'b@REDACTED', 'c@REDACTED'], []).

2. On any one of the three nodes:

    gdict:store(k1, 1, foo).
    gdict:store(k2, 2, foo).
    gdict:store(k3, 3, foo).

3. In the shell of the elected node, press Control-z to suspend the VM.
   For example's sake, we'll assume it's node 'a@REDACTED'.

4. Wait for the other two nodes to notice the timeout of 'a@REDACTED'.
   Then wait for the election announcement on one of the surviving nodes.
   Call the new elected node X.

5. Type "fg" to resume node 'a@REDACTED'.
   I see no election notice, so node X is still the leader.

6. On one of the other two nodes, 'b@REDACTED' or 'c@REDACTED':

    gdict:store(k4, 4, foo).

7. Run the following on each node:

    gdict:fetch_keys(foo).

8. The results that I see are:

   node 'a@REDACTED'    node 'b@REDACTED'    node 'b@REDACTED'
   [k1,k2,k3]           [k1,k2,k3,k4]        [k1,k2,k3,k4]

9. On node 'a@REDACTED":

    gdict:store(k8, 88888, foo).

10. Run the following on each node:

    gdict:fetch_keys(foo).

11. The results that I see are:

   node 'a@REDACTED'    node 'b@REDACTED'    node 'b@REDACTED'
   [k1,k2,k3,k8]        [k1,k2,k3,k4]        [k1,k2,k3,k4]