disk IO kills network IO, +A doesn't work as I hoped

Wed Aug 17 01:21:04 CEST 2005

I'm seeking advice from experts on the +A erl flag and the IO thread  
architecture. The issue I'm finding is that a blocking disk operation  
(fsync in my case) results in a network IO degradation of 2 orders of  
magnitude despite me using multiple IO threads.

My architecture contains a node per CPU that does nothing but handle  
a transaction log, trying to write to it and making sure the data is  
sync'd to disk. There's an erlang process in that node that will  
fsync the log file descriptor to disk and gather more log data from  
other processes on other nodes while the fsync is pending. Once the  
fsync is complete, the batch of log data was received while sync'ing  
will be sync'd next while a new data batch is accumulated.

There's lots of tricks that one can play to work around the sync time  
etc., but ultimately that time is in the order of milliseconds. That  
means that there's a number of erlang log-messages I can send to the  
log node while the log node is sync'ing to disk.

My little experiment between two distinct Linux hosts shows that  
erlang ping roundtrip times (no payload to serialize) drop by a  
factor of 100 once the receiving log node starts to do the disk IO  
(the write/fsync, write/fsync, ...). The disk IO data size is varied  
between 1 byte and a few kilobytes, and since I currently don't do  
any of the tricks alluded to above, an fsync requires in the order of  
15ms on the HW I'm using. Plenty of time to receive and process  
erlang messages sent from other nodes, I had hoped.

This is so despite me using the +A flag (I've tried 2 to 10 threads,  
assuming that 2 would be logical, i.e. 1 for disk and 1 for network IO).

Is this explicable by the thread architecture involved in IO?  Why  
does using 3 threads (in the example below) not allow the VM to  
sustain the ping times (roughly)?  I'm very puzzled about this since  
the writes are very small, and the fsync is really not using up any  
CPU time, or erlang reductions to speak of, i.e. there should be VM  
reduction cycles to process the network calls.

Below is the src and the command lines and erlang commands to  
reproduce the issue should you have 2 boxes available (2 boxes to  
make sure we're not fooled by context switching between the 2 VMs  
involved).

Any workarounds someone can suggest?  Have you seen similar issues in  
the past?

Thanks,
- Reto

reto@REDACTED 9% erl -pa . -sname probe
Erlang (BEAM) emulator version 5.4.3 [source] [hipe]

Eshell V5.4.3  (abort with ^G)
(probe@REDACTED)1> eflush:pingtest(probe@REDACTED).

...
8042 0.141000 ms   <--- simple ping rpc round trip time, fast network!
8043 0.144000 ms
8044 0.144000 ms
8045 0.151000 ms
8046 0.144000 ms
8047 0.144000 ms
8048 0.142000 ms
8049 17.6220 ms    <--- node telco02 started to write to disk, ping  
times increase by *100
8050 2.49300 ms
8051 239.361 ms
8052 234.661 ms
8053 167.171 ms
8054 111.985 ms
8055 411.450 ms
8056 285.551 ms
8057 278.745 ms
8058 315.564 ms
8059 278.830 ms
8060 309.407 ms
8061 248.964 ms
...

telco02$ erl +A3 -sname disk
Erlang (BEAM) emulator version 5.4.3 [source] [hipe]

Eshell V5.4.3  (abort with ^G)
(disk@REDACTED)1> eflush:run().

1 15.1836 ms 65.8606
129 15.3112 ms 65.3118
257 14.7000 ms 68.0272
385 15.1941 ms 65.8152
513 14.8843 ms 67.1850
641 15.3734 ms 65.0474
769 15.4631 ms 64.6702
897 15.1325 ms 66.0829
...

-module(eflush).

-export([run/0, run/1, pingtest/1]).

-export([ping/0, callout/1]).
-export([run_kernel/3, close/1]).

run() ->
     lists:map(
       fun(Length) ->
           run(Length)
%      end, lists:seq(0, 65536*3, 64)),
%      end, lists:seq(130000, 150000, 64)),
       end, lists:seq(1,8196,128)),
     ok.

run(Length) ->
     Duration = 10, % number of test runs
     N = 50, % number of times write+flush is averaged over

     L = lists:duplicate(Length, 65), % write Length bytes
     B = list_to_binary(L),
     R = lists:map(
       fun(I) ->
           {ok, H} = file:open("test", [raw, write, binary]),
           {T1, ok} = timer:tc(?MODULE, run_kernel, [H, N, {B}]),
           close(H),
           {I, T1/(N*1000)}
       end, lists:seq(1, Duration)),
     RF = lists:map(fun({_, X}) -> X end, R),
     AvgF = lists:sum(RF)/length(RF),
     io:format("~p ~p ms ~p~n", [Length, AvgF, 1000/AvgF]),
     ok.

close(H) ->
     file:close(H).

run_kernel(H, N, Parameter) ->
     iteration(H, Parameter, N).

iteration(_, _, 0) ->
     ok;
iteration(H, {Binary} = Parameter, N) ->
     file:write(H, Binary),
     file:sync(H),
     iteration(H, Parameter, N-1).

pingtest(Node) ->
     R = lists:map(
       fun(I) ->
           {D, ok} = timer:tc(?MODULE, callout, [Node]),
           io:format("~p ~p ms~n", [I, D/1000]),
           {I, D/1000}
       end, lists:seq(1, 99999)).

callout(Node) ->
     rpc:call(Node, ?MODULE, ping, []).

ping() ->
     ok.