disk IO kills network IO, +A doesn't work as I hoped

Wed Aug 17 01:25:46 CEST 2005

And there's 2 typos of course ;-)

The ping roundtrip times to the "disk node" increase (not drop) by a  
factor of 100 when disk IO kicks in.
The command on the telco04 node (the one that pings the disk node)  
is: eflush:pingtest(disk@REDACTED).

- Reto

On Aug 16, 2005, at 4:21 PM, Reto Kramer wrote:

> I'm seeking advice from experts on the +A erl flag and the IO  
> thread architecture. The issue I'm finding is that a blocking disk  
> operation (fsync in my case) results in a network IO degradation of  
> 2 orders of magnitude despite me using multiple IO threads.
>
> My architecture contains a node per CPU that does nothing but  
> handle a transaction log, trying to write to it and making sure the  
> data is sync'd to disk. There's an erlang process in that node that  
> will fsync the log file descriptor to disk and gather more log data  
> from other processes on other nodes while the fsync is pending.  
> Once the fsync is complete, the batch of log data was received  
> while sync'ing will be sync'd next while a new data batch is  
> accumulated.
>
> There's lots of tricks that one can play to work around the sync  
> time etc., but ultimately that time is in the order of  
> milliseconds. That means that there's a number of erlang log- 
> messages I can send to the log node while the log node is sync'ing  
> to disk.
>
> My little experiment between two distinct Linux hosts shows that  
> erlang ping roundtrip times (no payload to serialize) drop by a  
> factor of 100 once the receiving log node starts to do the disk IO  
> (the write/fsync, write/fsync, ...). The disk IO data size is  
> varied between 1 byte and a few kilobytes, and since I currently  
> don't do any of the tricks alluded to above, an fsync requires in  
> the order of 15ms on the HW I'm using. Plenty of time to receive  
> and process erlang messages sent from other nodes, I had hoped.
>
> This is so despite me using the +A flag (I've tried 2 to 10  
> threads, assuming that 2 would be logical, i.e. 1 for disk and 1  
> for network IO).
>
> Is this explicable by the thread architecture involved in IO?  Why  
> does using 3 threads (in the example below) not allow the VM to  
> sustain the ping times (roughly)?  I'm very puzzled about this  
> since the writes are very small, and the fsync is really not using  
> up any CPU time, or erlang reductions to speak of, i.e. there  
> should be VM reduction cycles to process the network calls.
>
> Below is the src and the command lines and erlang commands to  
> reproduce the issue should you have 2 boxes available (2 boxes to  
> make sure we're not fooled by context switching between the 2 VMs  
> involved).
>
> Any workarounds someone can suggest?  Have you seen similar issues  
> in the past?
>
> Thanks,
> - Reto
>
>
> reto@REDACTED 9% erl -pa . -sname probe
> Erlang (BEAM) emulator version 5.4.3 [source] [hipe]
>
> Eshell V5.4.3  (abort with ^G)
> (probe@REDACTED)1> eflush:pingtest(probe@REDACTED).
>
> ...
> 8042 0.141000 ms   <--- simple ping rpc round trip time, fast network!
> 8043 0.144000 ms
> 8044 0.144000 ms
> 8045 0.151000 ms
> 8046 0.144000 ms
> 8047 0.144000 ms
> 8048 0.142000 ms
> 8049 17.6220 ms    <--- node telco02 started to write to disk, ping  
> times increase by *100
> 8050 2.49300 ms
> 8051 239.361 ms
> 8052 234.661 ms
> 8053 167.171 ms
> 8054 111.985 ms
> 8055 411.450 ms
> 8056 285.551 ms
> 8057 278.745 ms
> 8058 315.564 ms
> 8059 278.830 ms
> 8060 309.407 ms
> 8061 248.964 ms
> ...
>
>
> telco02$ erl +A3 -sname disk
> Erlang (BEAM) emulator version 5.4.3 [source] [hipe]
>
> Eshell V5.4.3  (abort with ^G)
> (disk@REDACTED)1> eflush:run().
>
> 1 15.1836 ms 65.8606
> 129 15.3112 ms 65.3118
> 257 14.7000 ms 68.0272
> 385 15.1941 ms 65.8152
> 513 14.8843 ms 67.1850
> 641 15.3734 ms 65.0474
> 769 15.4631 ms 64.6702
> 897 15.1325 ms 66.0829
> ...
>
>
>
>
> -module(eflush).
>
> -export([run/0, run/1, pingtest/1]).
>
> -export([ping/0, callout/1]).
> -export([run_kernel/3, close/1]).
>
> run() ->
>     lists:map(
>       fun(Length) ->
>           run(Length)
> %      end, lists:seq(0, 65536*3, 64)),
> %      end, lists:seq(130000, 150000, 64)),
>       end, lists:seq(1,8196,128)),
>     ok.
>
> run(Length) ->
>     Duration = 10, % number of test runs
>     N = 50, % number of times write+flush is averaged over
>
>     L = lists:duplicate(Length, 65), % write Length bytes
>     B = list_to_binary(L),
>     R = lists:map(
>       fun(I) ->
>           {ok, H} = file:open("test", [raw, write, binary]),
>           {T1, ok} = timer:tc(?MODULE, run_kernel, [H, N, {B}]),
>           close(H),
>           {I, T1/(N*1000)}
>       end, lists:seq(1, Duration)),
>     RF = lists:map(fun({_, X}) -> X end, R),
>     AvgF = lists:sum(RF)/length(RF),
>     io:format("~p ~p ms ~p~n", [Length, AvgF, 1000/AvgF]),
>     ok.
>
> close(H) ->
>     file:close(H).
>
> run_kernel(H, N, Parameter) ->
>     iteration(H, Parameter, N).
>
> iteration(_, _, 0) ->
>     ok;
> iteration(H, {Binary} = Parameter, N) ->
>     file:write(H, Binary),
>     file:sync(H),
>     iteration(H, Parameter, N-1).
>
> pingtest(Node) ->
>     R = lists:map(
>       fun(I) ->
>           {D, ok} = timer:tc(?MODULE, callout, [Node]),
>           io:format("~p ~p ms~n", [I, D/1000]),
>           {I, D/1000}
>       end, lists:seq(1, 99999)).
>
> callout(Node) ->
>     rpc:call(Node, ?MODULE, ping, []).
>
> ping() ->
>     ok.
>
>