disk IO kills network IO, +A doesn't work as I hoped
Reto Kramer
kramer@REDACTED
Wed Aug 17 01:25:46 CEST 2005
And there's 2 typos of course ;-)
The ping roundtrip times to the "disk node" increase (not drop) by a
factor of 100 when disk IO kicks in.
The command on the telco04 node (the one that pings the disk node)
is: eflush:pingtest(disk@REDACTED).
- Reto
On Aug 16, 2005, at 4:21 PM, Reto Kramer wrote:
> I'm seeking advice from experts on the +A erl flag and the IO
> thread architecture. The issue I'm finding is that a blocking disk
> operation (fsync in my case) results in a network IO degradation of
> 2 orders of magnitude despite me using multiple IO threads.
>
> My architecture contains a node per CPU that does nothing but
> handle a transaction log, trying to write to it and making sure the
> data is sync'd to disk. There's an erlang process in that node that
> will fsync the log file descriptor to disk and gather more log data
> from other processes on other nodes while the fsync is pending.
> Once the fsync is complete, the batch of log data was received
> while sync'ing will be sync'd next while a new data batch is
> accumulated.
>
> There's lots of tricks that one can play to work around the sync
> time etc., but ultimately that time is in the order of
> milliseconds. That means that there's a number of erlang log-
> messages I can send to the log node while the log node is sync'ing
> to disk.
>
> My little experiment between two distinct Linux hosts shows that
> erlang ping roundtrip times (no payload to serialize) drop by a
> factor of 100 once the receiving log node starts to do the disk IO
> (the write/fsync, write/fsync, ...). The disk IO data size is
> varied between 1 byte and a few kilobytes, and since I currently
> don't do any of the tricks alluded to above, an fsync requires in
> the order of 15ms on the HW I'm using. Plenty of time to receive
> and process erlang messages sent from other nodes, I had hoped.
>
> This is so despite me using the +A flag (I've tried 2 to 10
> threads, assuming that 2 would be logical, i.e. 1 for disk and 1
> for network IO).
>
> Is this explicable by the thread architecture involved in IO? Why
> does using 3 threads (in the example below) not allow the VM to
> sustain the ping times (roughly)? I'm very puzzled about this
> since the writes are very small, and the fsync is really not using
> up any CPU time, or erlang reductions to speak of, i.e. there
> should be VM reduction cycles to process the network calls.
>
> Below is the src and the command lines and erlang commands to
> reproduce the issue should you have 2 boxes available (2 boxes to
> make sure we're not fooled by context switching between the 2 VMs
> involved).
>
> Any workarounds someone can suggest? Have you seen similar issues
> in the past?
>
> Thanks,
> - Reto
>
>
> reto@REDACTED 9% erl -pa . -sname probe
> Erlang (BEAM) emulator version 5.4.3 [source] [hipe]
>
> Eshell V5.4.3 (abort with ^G)
> (probe@REDACTED)1> eflush:pingtest(probe@REDACTED).
>
> ...
> 8042 0.141000 ms <--- simple ping rpc round trip time, fast network!
> 8043 0.144000 ms
> 8044 0.144000 ms
> 8045 0.151000 ms
> 8046 0.144000 ms
> 8047 0.144000 ms
> 8048 0.142000 ms
> 8049 17.6220 ms <--- node telco02 started to write to disk, ping
> times increase by *100
> 8050 2.49300 ms
> 8051 239.361 ms
> 8052 234.661 ms
> 8053 167.171 ms
> 8054 111.985 ms
> 8055 411.450 ms
> 8056 285.551 ms
> 8057 278.745 ms
> 8058 315.564 ms
> 8059 278.830 ms
> 8060 309.407 ms
> 8061 248.964 ms
> ...
>
>
> telco02$ erl +A3 -sname disk
> Erlang (BEAM) emulator version 5.4.3 [source] [hipe]
>
> Eshell V5.4.3 (abort with ^G)
> (disk@REDACTED)1> eflush:run().
>
> 1 15.1836 ms 65.8606
> 129 15.3112 ms 65.3118
> 257 14.7000 ms 68.0272
> 385 15.1941 ms 65.8152
> 513 14.8843 ms 67.1850
> 641 15.3734 ms 65.0474
> 769 15.4631 ms 64.6702
> 897 15.1325 ms 66.0829
> ...
>
>
>
>
> -module(eflush).
>
> -export([run/0, run/1, pingtest/1]).
>
> -export([ping/0, callout/1]).
> -export([run_kernel/3, close/1]).
>
> run() ->
> lists:map(
> fun(Length) ->
> run(Length)
> % end, lists:seq(0, 65536*3, 64)),
> % end, lists:seq(130000, 150000, 64)),
> end, lists:seq(1,8196,128)),
> ok.
>
> run(Length) ->
> Duration = 10, % number of test runs
> N = 50, % number of times write+flush is averaged over
>
> L = lists:duplicate(Length, 65), % write Length bytes
> B = list_to_binary(L),
> R = lists:map(
> fun(I) ->
> {ok, H} = file:open("test", [raw, write, binary]),
> {T1, ok} = timer:tc(?MODULE, run_kernel, [H, N, {B}]),
> close(H),
> {I, T1/(N*1000)}
> end, lists:seq(1, Duration)),
> RF = lists:map(fun({_, X}) -> X end, R),
> AvgF = lists:sum(RF)/length(RF),
> io:format("~p ~p ms ~p~n", [Length, AvgF, 1000/AvgF]),
> ok.
>
> close(H) ->
> file:close(H).
>
> run_kernel(H, N, Parameter) ->
> iteration(H, Parameter, N).
>
> iteration(_, _, 0) ->
> ok;
> iteration(H, {Binary} = Parameter, N) ->
> file:write(H, Binary),
> file:sync(H),
> iteration(H, Parameter, N-1).
>
> pingtest(Node) ->
> R = lists:map(
> fun(I) ->
> {D, ok} = timer:tc(?MODULE, callout, [Node]),
> io:format("~p ~p ms~n", [I, D/1000]),
> {I, D/1000}
> end, lists:seq(1, 99999)).
>
> callout(Node) ->
> rpc:call(Node, ?MODULE, ping, []).
>
> ping() ->
> ok.
>
>
More information about the erlang-questions
mailing list